Generalized belief propagation for probabilistic systems

ABSTRACT

A method determines approximate probabilities of states of a system represented by a model. The model includes nodes connected by links. Each node represents possible states of a corresponding part of the system, and each link represents statistical dependencies between possible states of related nodes. The nodes are grouped into arbitrary sized clusters such that every node is included in at least one cluster and each link is completely contained in at least one cluster. Messages, based on the arbitrary sized cluster, are defined. Each message has associated sets of source nodes and destination nodes, and a value and a rule depending on other messages and on selected links connecting the source nodes and destination nodes. The value of each message is updated until a termination condition is reached. When the termination condition is reached, approximate probabilities of the states of the system are determined from the values of the messages.

FIELD OF THE INVENTION

This invention relates generally to modeling probabilistic systems, andmore particularly, to modeling probabilistic systems using beliefpropagation in a Markov network.

BACKGROUND OF THE INVENTION

Computer models are frequently used to study the behavior of complexprobabilistic systems. When the systems contain many inter-dependentrandom variables, Markov networks are often used. In a Markov network,nodes of the network represent the possible states of a part of thesystem, and links between the nodes represent statistical dependenciesbetween the possible states of those nodes.

By the Hammersly-Clifford theorem, from the study of Markov networks,the probability of any set of states at the nodes of the network can bewritten as the product of compatibility functions between clusters ofnodes.

FIG. 1 shows a simple network with four nodes labeled a, b, c, and d.The links between the nodes represent the statistical dependenciesbetween the possible states of the nodes. For the case of pairwiseprobabilistic interactions between nodes of the network, the overalljoint probability of the system can be expressed as the product ofcompatibility functions for each linked pair of nodes:P(s _(a) , s _(b) , s _(c) , s _(d))=φ_(ab)(s _(a) , s _(b))φ_(bc)(s_(b) , s _(c))φ_(ca)(s _(c) , s _(a))φ_(bd)(s _(b) , s _(d)),  [1]where φ_(ab) is the compatibility matrix between nodes a and b, s_(a) isa random variable describing the state at node a, and similarly for theother nodes and variables.

Often, Markov networks for practical applications are very large. Forexample, an image acquired from a scene by a camera may be representedby a Markov network between all small neighboring patches, or evenpixels, of the acquired image. Similarly, the well known “travellingsalesman problem” can map onto a Markov network where the maximumprobability state corresponds to the shortest path of the salesman'sroute. This network has as many nodes as cities to be visited. In someMarkov networks, the nodes can represent measured input signals, such asvisual input data. Markov models are also extensively used in speechrecognition systems.

To analyze the probabilistic system modeled by a Markov network, onetypically wants to find the marginal probabilities of certain networkvariables of interest. (The “marginal probability” of a variablesignifies the probability of that variable ignoring, the state of anyother network variable.) For example, it may be useful to examine theprobability of a variable that represents an underlying explanation forsome measured data, such as the probability of particular words used tovocalize particular speech sounds. To find those probabilities, theMarkov network is marginalized over all the other variables in thenetwork. This gives the probability of the variable representing theexplanation, given the measured input data values. This marginalizationis thus a form of inference.

One may also want to find states of the nodes, which maximize thenetwork probabilities. For example, for the Markov network correspondingto the travelling salesman problem, it is desired to find the state ateach node which maximize the probability of the Markov network. Thesestates, which minimize the length of the salesman's route, are known asthe maximum a posteriori probability (MAP) states.

In the example of FIG. 1, it is possible to determine the marginalprobability P(s_(a)) of the variable at node a by summing the randomvalues at nodes b, c, and d: $\begin{matrix}{{P\left( s_{a} \right)} = {\sum\limits_{s_{b},s_{c},s_{d}}{{\phi_{cb}\left( {s_{a},s_{b}} \right)}{\phi_{bc}\left( {s_{b},s_{c}} \right)}{\phi_{ca}\left( {s_{c},s_{a}} \right)}{{\phi_{bd}\left( {s_{b},s_{d}} \right)}.}}}} & \lbrack 2\rbrack\end{matrix}$

In general, especially for large networks, these marginal probabilitiesare infeasible to determine directly. The joint sum over all possiblestates of all the nodes can be of too high a dimension to sumnumerically, particularly when the network has closed loops.

FIGS. 2 a-b show examples of Markov networks with many loops for whichit is difficult to find either the marginal probability at a node, orthe state of the node which maximizes the overall probability of theMarkov network. Both networks are in the form of lattices, which arecommonly used to describe the joint probabilities of variablesspatially, distributed over two dimensions. FIG. 2 a shows a rectangularlattice, and FIG. 2 b shows a triangular lattice. These type of latticenetworks are used to model many systems.

Techniques to approximate the marginal probabilities for such structuresare known, but these techniques are typically very slow. Simulatedannealing can be used, or Gibbs sampling, see Geman et al. “Stochasticrelaxation, Gibbs distribution, and the Bayesian restoration of images,”IEEE Pattern Analysis and Machine Intelligence, 6:721-741, 1984. Anotherclass of approximation techniques are variational methods, see Jordan,“Learning in graphical models,” MIT Press, 1998. However, these methodsrequire an appropriate class of variational approximation functions fora particular problem. It is not obvious which functions, out of allpossible ones, to use for the approximations.

For the special case of Markov networks that form chains or trees, thereis a local message-passing method that calculates the marginalprobabilities at each node, see Pearl, “Probabilistic reasoning inintelligent systems: networks of plausible inference,” Morgan Kaufmann,1988. The later method is now in widespread use, and is equivalent tothe “forward-backward” and Viterbi methods for solving one dimensionalHidden Markov Models (HMM), and to Kalman filters and theirgeneralization to trees, see Luettgen et al. in “Efficient multiscaleregularization with applications to the computation of optical flow,”IEEE Trans. Image Processing, 3(1):41-64, 1994. This message-passingmethod gives the exact marginal probabilities for any Markov networkthat does not have loops. This is referred to as the “standard” beliefpropagation, or message-passing method below.

Unfortunately, many Markov networks of practical interest do containloops. For example, an image, modeled as a Markov network of local imagepatches connected to their nearest neighbors, gives a lattice structuredMarkov network as shown in FIGS. 2 a-b, also called a Markov randomfield. This type of network contains many loops.

Another method for inference in Markov networks applies the localmessage-passing rules derived for trees and chains in a network, eventhough the network may contain loops, see Weiss, “Belief propagation andrevision in networks with loops,” Technical Report 1616, MIT AI Lab,1997. This is referred to as the “loopy” belief propagation method inthe description below, although it should be clearly understood that the“loopy” method is nothing more than the “standard” belief propagationmethod applied to a network with loops. When such a procedure converges,it can yield an approximate determination of the marginal probabilities.However, the loopy method sometimes gives too poor an approximation tothe marginal probabilities, and often does not even converge. In thelatter case, the approximation gives no single answer for the desiredmarginal probabilities.

Therefore, it is desired to provide a method for determining marginalprobabilities in Markov networks that is both relatively fast and moreaccurate than the loopy method. Furthermore, it is desired to provide amethod for networks with loops that converges more reliably than theprior art loopy belief propagation method.

SUMMARY OF THE INVENTION

The present invention provides a method for determining theprobabilities of nodes in a network model of a complex probabilisticsystem. More particularly, the method determines desired marginal ormaximum a posteriori probabilities in networks with loops. The methoduses a message-passing scheme, appropriate for networks with loops,which is more accurate and typically converges more reliably and infewer iterations than prior art loopy belief propagation methods.

The invention describes a class of procedures in which computationalcost and accuracy can be traded off against each other, allowing a userof the invention to select more computation for more accuracy in aparticular application of the invention.

The invention has two major advantages over the prior art loopy beliefpropagation method. First, the invented method normally gives much moreaccurate answers for the desired marginal probabilities. Second, theinvented method can converge to a single answer in cases where the loopymethod does not.

Instead of finding the marginal probability at a node, one embodiment ofthe invention finds the states at each node which approximately maximizethe probability of the entire network. Thus, the invention provides anovel way to approximate both the marginal probabilities and MAP statesin Markov networks.

Many Markov network problems of interest are known to be NP-hardproblems. A problem is NP-hard when it is intrinsically harder thanthose problems that can be solved by a Turing machine innondeterministic polynomial time. When a decision version of acombinatorial optimization problem belongs to the class of NP-completeproblems, which includes the traveling salesman problem described above,an optimization version is NP-hard. The invented method yields fast,approximate solutions for some of these very difficult optimizationproblems.

More particularly, the invention provides a method for determiningmarginal probabilities of states of a system represented by a model. Themodel includes nodes connected by links. Each node represents possiblestates of a corresponding part of the system, and each link representsstatistical dependencies between possible states of related nodes.

The nodes are grouped into arbitrary sized clusters such that every nodeis included in at least one cluster and each link is completelycontained in at least one cluster. Based on the arbitrary sized cluster,messages are defined. Each message has associated sets of source nodesand destination nodes, a value for each state of the destination nodesand a rule that depends on other messages and on selected linksconnecting the source nodes and destination nodes. The values of eachmessage are updated using the associated rule until a terminationcondition is reached, at which point the probabilities of the states ofthe system are determined from the values of the messages.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is graph of a network with four nodes and links;

FIG. 2 a is a graph of a square network;

FIG. 2 b is a graph of a triangular network;

FIG. 3 is a flow diagram of a method for propagating beliefs in anetwork according to a preferred embodiment of the invention;

FIGS. 4 a-c are graphs of nodes grouped into arbitrary sizedintersecting clusters;

FIG. 5 is a graph of a nine node network grouped into four intersectingclusters;

FIG. 6 is a flow diagram of detailed steps of the method according tothe invention;

FIG. 7 is a graph of a hierarchy of regions;

FIGS. 8 a-c are graphs of belief propagation for a rectangular network;

FIG. 9 is a graph of an identity for the belief propagation of FIGS. 8a-c;

FIG. 10 is a graph of another identity of the belief propagation of FIG.7;

FIGS. 11 a-c are graphs of belief propagation for a triangular network;

FIG. 12 is a graph of an identity for the belief propagation of FIGS. 11a-c;

FIG. 13 is a graph of another identity for the belief propagation ofFIGS. 11 a-c;

FIG. 14 is graph of a nine node network grouped into four clusters;

FIG. 15 a is a graph of the network of FIG. 14 with the nodes groupedinto four clusters;

FIG. 15 b is a graph of a super-node network showing standard messagepassing;

FIG. 15 c is a graph showing standard and normalized message passing;

FIG. 15 d is a graph of a network with nine super-nodes; and

FIG. 16 is a flow diagram of an alternative embodiment of the beliefpropagation method according to the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Introduction

Our invention builds on the “cluster variational method” for analyzingcooperative phenomena in statistical physics, see for example, thespecial issue in honor of R. Kikuchi, Progress in Theoretical PhysicsSupplements, vol. 115, 1994.

A physical system is often modeled in statistical mechanics as a Markovnetwork, where the states of the nodes in the network correspond to thephysical states of the particles or spins in the physical system. Thecluster variational method derives approximations to the “Gibbs freeenergy” of a physical system. We refer to these as “Kikuchiapproximations” to the Gibbs free energy, or the “Kikuchi free energy.”From a Kikuchi approximation to the Gibbs free energy, one can determineproperties of a physical system, such as magnetization, specific heat,or the critical temperature for a magnetic spin system.

The simplest Kikuchi approximation derived with the cluster variationalmethod corresponds to an approximation first proposed by Bethe, seeBethe, Proc. Roy. Soc. (London) A 150 (1935) 552. The loopy beliefpropagation method, when it converges, gives marginal probabilitiesidentical to those obtained from the Bethe approximation.

Solutions obtained from propagation rules according to our invention areequivalent to solving the appropriate physical system under thecorresponding Kikuchi approximation. The method of minimizing theKikuchi free energy directly will therefore give identical results toour method, but such a direct minimization will be extremely slow and beplagued by the problem of convergence to inappropriate local minimal inthe free energy surface. Thus, physics calculations made with morecomplicated Kikuchi approximations have typically only been tractablefor homogeneous models of physical systems. As an advantage, our beliefpropagation method is also practical for heterogeneous models ofphysical systems where the nodes of the model have arbitrary topologyand interdependencies.

Calculation of Marginal Probabilities by Kikuchi Free EnergyMinimization

As shown in FIG. 3, we begin with a network 300 representing a system ormodel. Typically, the network is constructed ahead of time by an“expert” in the art field in which the model operates. It is not our jobto construct the model; instead, we provide a method for using themodel, even if the model is heterogeneous and includes many loops. Forexample, in a medical diagnosis system, the nodes can represent clinicalfindings, physiological and psychological conditions, laboratoryresults, pathologies, etc. Here, the goal may be to use the model tomake the best (most probable) diagnosis given observations made for aparticular patient. In an error-correcting coding system, the nodesrepresent source bits or parity bits, and the links represent therequired relationships between the bits. Here, the goal may be to usethe model to decode a block of bits that has been corrupted by noise.

The model includes a plurality of nodes 301. A random variable s at eachnode takes on one of a discrete set of states, or it may take on acontinuum of states. The nodes are connected to others by links 302 toform the network model 300. Typically, a link connects two nodes,however higher-order links can connect more than two nodes. The networkmodel 300 can represent a magnetic spin system, a medical diagnosissystem, an image-processing system, a speech recognition system, agenetic regulatory network, a decoding system, or other complexreal-world systems.

By definition, for a Markov network with pair-wise statisticaldependencies between nodes, the probability of a particular assignmentof values s at the nodes is given by: $\begin{matrix}{{{P\left( {s_{1},s_{2},{\ldots\quad s_{N}}} \right)} = {\frac{1}{Z}{\prod\limits_{i,j}{{\phi_{ij}\left( {s_{i},s_{j}} \right)}{\prod\limits_{i}{\psi\left( s_{i} \right)}}}}}},} & \lbrack 3\rbrack\end{matrix}$where the first product runs over all the linked neighboring nodes, iand j.

The φ compatibility matrices represent the statistical dependenciesbetween the nodes, as represented by the links 302 in the network model300. For example, in a medical diagnosis system, one node mightrepresent a symptom, and the nodes that it is linked to are relatedsymptoms or diseases. The numerical values of the compatibility matricesφ, represent the strength of the relation between specific nodes. Forexample, in a medical diagnosis model, some symptoms might be stronglycorrelated with certain diseases, i.e. a strong statistical dependency,while other symptoms might only have weak statistical correlation withother symptoms or diseases.

For some systems, it may be desired to represent the statisticaldependencies between the states of the nodes by compatibility functions(“links”) that are higher-order tensors. For example, if the states ofnodes i, j, and k are mutually statistically dependent, but thosedependencies cannot be expressed in terms of pair-wise interactions,then we introduce the tensor φ_(ijk)(s_(i),s_(j),s_(k)) to representthose dependencies. The over-all probability of the system would thenalso include the product over all higher-order tensors as well. Themethod we describe here will also apply to systems with suchhigher-order links between nodes. All that is necessary is to includeterms that arise from the higher-order links in the “energy” and “regionenergies” that we describe below.

The ψ functions represent the “evidence” that each node is in anyparticular one of its states before we consider any of the other nodes.While the φ functions will normally never change, the ψ functions willtypically change from one use of the model to another. For example, inthe case of a medical diagnosis model, the ψ_(i)(s_(i)) functionrepresents, for a specific patient, the results of a given test. To usethe same model for a different patient, we use different evidence,represented through the ψ functions, but we do not change the φfunctions, which represent the relationships between different symptoms,diseases, tests, etc.

In equation [3], Z is a normalization constant that insures that the sumof the probabilities of all possible states of the overall system isequal to one.

In order to describe the “Kikuchi free energy” of the model, it ishelpful to define some additional functions, which are motivated bycommon conventions from statistical physics. We define the “bondstrength” J_(ij)(s_(i),s_(j)) between two nodes i and j by:φ_(ij)(s _(i) ,s _(j))=e ^(J) ^(ij) ^((s) ^(i) ^(,s) ^(j) ^()/T)  [4]where T is a parameter called the “temperature.” We define the “field”h_(i)(s_(i)) at the node i by:ψ_(i)(s _(i))=e ^(−h) ^(i) ^((s) ^(i) ^()/T)  [5]

In terms of these variables, we have: $\begin{matrix}{{{P\left( {s_{1},s_{2},\ldots\quad,s_{N}} \right)} = {\frac{1}{Z}e^{{- {E{({s_{1},s_{2},\ldots\quad,s_{N}})}}}/T}}},} & \lbrack 6\rbrack\end{matrix}$where E is the “energy”: $\begin{matrix}{{E\left( {s_{1},s_{2},\ldots\quad,s_{N}} \right)} = {{\sum\limits_{ij}{J_{ij}\left( {s_{i},s_{j}} \right)}} + {\sum\limits_{i}{{h_{i}\left( s_{i} \right)}.}}}} & \lbrack 7\rbrack\end{matrix}$

In statistical physics, equation [6], which is equivalent to theoriginal definition [3] of the Markov model, is known as “Boltzmann'sLaw,” and the normalization constant Z is called the “partitionfunction.” The partition function is defined explicitly by$\begin{matrix}{Z = {\sum\limits_{s_{1},s_{2},\ldots\quad,s_{N}}{e^{{- {E{({S_{1},S_{2},\ldots\quad,S_{N}})}}}/T}.}}} & \lbrack 8\rbrack\end{matrix}$

Following the conventions of statistical physics, we define a quantitycalled the “Gibbs free energy.” The Gibbs free energy G is a function ofthe probability distribution P(s₁,s₂, . . . ,s_(N)). By design, G isdefined so that if it is minimized as a function P, then Boltzmann's Lawis automatically obeyed. G is defined by: $\begin{matrix}\begin{matrix}{{G\left( {{P\left( {s_{1},s_{2},\ldots\quad,s_{N}} \right)},\gamma} \right)} = {U - {T\quad S} +}} \\{{\gamma\left\lbrack {1 - {\sum\limits_{s_{1},s_{2},\ldots\quad,s_{N}}{P\left( {s_{1},s_{2},\ldots\quad,s_{N}} \right)}}} \right\rbrack}.}\end{matrix} & \lbrack 9\rbrack\end{matrix}$

Here, U is the expected value of the energy, also called the “internalenergy”: $\begin{matrix}{U = {\sum\limits_{s_{1},s_{2},\ldots\quad,s_{N}}{{E\left( {s_{1},s_{2},\ldots\quad,s_{N}} \right)}{P\left( {s_{1},s_{2},\ldots\quad,s_{N}} \right)}}}} & \lbrack 10\rbrack\end{matrix}$and S is the entropy: $\begin{matrix}{S = {- {\sum\limits_{s_{1},s_{2},{\ldots\quad s_{N}}}{{P\left( {s_{1},s_{2},\ldots\quad,s_{N}} \right)}{{\ln\left( {P\left( {s_{1},s_{2},\ldots\quad,s_{N}} \right)} \right)}.}}}}} & \lbrack 11\rbrack\end{matrix}$

The last term in equation [9] is a Lagrange multiplier term designed toenforce the constraint that the sum of the probabilities over all statesis equal to one. In order to insure that this constraint is satisfied,we set the partial derivative of G with respect to the Lagrangemultiplier γ equal to zero.

If we differentiate the Gibbs free energy G with respect to theprobability distribution P and the Lagrange multiplier y, and set thosepartial derivatives equal to zero, then we recover Boltzmann's Law.Thus, to understand physical systems, physicists often try to calculateand differentiate the Gibbs free energy.

For a large system, it is intractable to calculate G exactly, so Bethe,Kikuchi, and many others have developed approximation schemes. In theseschemes, one determines exactly the Gibbs free energy over some smallclusters of nodes of the network, and then approximates the total Gibbsfree energy as a sum over many small clusters of nodes, correcting forover-counting in regions where the different clusters overlap. In theBethe approximation, the clusters are precisely those pairs of nodesthat are linked in the Markov network, while Kikuchi's more accurateapproximations, also called the “cluster variational method,” allow forclusters of a larger size.

The cluster variational method errs by missing contributions to theGibbs free energy that arise from high-order correlations between nodesthat cannot be described in terms of the chosen clusters orintersections of clusters. The larger the size of the small clusters,the smaller the error.

In principle, the cluster variational method allows for arbitraryclusters on arbitrary Markov networks. In practice, however, mostprevious computations involving minimization of a Kikuchi free energywere restricted to symmetrically placed clusters over a homogenousnetwork. Computations on inhomogeneous networks have been performed, butonly for the Bethe approximation, see Morita, Physica 98A (1979) 566.

By “homogenous” networks, we mean networks for which the nodes arearranged in some regular repeated pattern, and the interactions betweenthe nodes depends only on their position in that pattern. Many magneticspin systems are “homogenous” in this sense, but most of the other typesof systems that are modeled with Markov networks, such as the example ofa medical diagnosis system, are not homogenous.

Previous prior art Kikuchi free energy minimization calculations wererestricted to symmetrically placed clusters over a homogeneous networkbecause, before our invention, there was no practical way to quicklyminimize a Kikuchi free energy defined for arbitrary clusters over aninhomogeneous Markov network of arbitrary topology.

Using the Kikuchi approximate form for the total Gibbs free energy, wecan approximately calculate the marginal probabilities of the variablesat the nodes, and also the marginal probabilities of collections ofnodes. We extend this technique to develop a message-passing method thatconverges to solutions corresponding to a Kikuchi approximation of theGibbs free energy. A message-passage method, as an advantage over theprior art Kikuchi free energy calculations, is very much faster and morepractical for inhomogeneous networks.

Kikuchi Free Energy Calculation for Simple Example

We use FIGS. 4 a-c to illustrate our generalized message-based beliefpropagation method. This example network shows how we develop amessage-passing method that has a corresponding Kikuchi approximationresult as a fixed point of the messages. That is to say, ourmessage-passing method will converge to a solution that is equivalent tothe solution of the Kikuchi approximation.

FIG. 4 a is a graph of many nodes grouped into three clusters. Theclusters intersect each other. The nodes and their links are not shown,and indeed, the number of nodes per cluster can be quite large. In realworld physical systems, one usually chooses many more than threeclusters for a Kikuchi approximation. FIGS. 4 b-c depict two possibleinstantiations of the three clusters and their intersections.

As described above with respect to FIG. 3, we group 310 the nodes of thenetwork 300 into arbitrary intersecting clusters 311. A cluster is saidto be intersecting if a single node appears in more than one cluster.Because we do an arbitrary clustering, we can focus on regions of themodel that might be more significant to an exact solution. We requirethat every node is included in at least cluster. We also require thatevery link in the Markov network is completely included in at least onecluster. The first constraint ensures that our clusters represent andconsider all nodes of the model, and the second constraint ensures thatthe clusters will intersect.

FIG. 4 a shows abstractly three clusters of nodes (regions 1, 2, and 3),and their intersections (regions 12, 31, and 23), and intersections ofintersections (region 123).

We use the following terminology. The largest regions are referred to as“clusters.” The clusters, along with their intersections, and theintersections of those intersections, and so on, are referred tocollectively as “regions.” A “sub-region” of region r is a region thatconsists of a proper subset of the nodes in region r. Region r is a“super-region” of region s if and only if region s is a sub-region ofregion r.

The number of nodes in each cluster can vary, and be different from eachother. As stated before larger clusters will yield better results thansmall clusters albeit at a greater computational cost. Note that region12 is interpreted to include region 123, just as region 1 also includesregion 123.

In equation [12] below, we approximate the total Gibbs free energy forthe system 300 by the sum of the free energies of each of the clusters.Then, we include correction terms by subtracting the free energy ofintersection regions where we over-counted the free energy. We mayinclude additional correction terms for the over-counted intersectionsof the intersection regions, etc. In addition, we impose, with Lagrangemultipliers, the additional constraints that the probabilities of eachregion sum to one, and that the probabilities have the correctmarginalization relationships with respect to each other.

We express the Gibbs free energy in terms of the probabilities of eachof the regions. We let α_(i) be a particular assignment of node valueswithin the cluster region i. This is what we called s₁, s₂, . . . ,s_(N) in equation [3] above. Our notation for region intersections is toconcatenate the indices of the intersecting regions, so α_(ijk) is aparticular assignment of node values within the region of intersectionof regions i, j, and k.

The “region energy” E(α_(i)) includes those terms in the overall energythat can be attributed to nodes or links contained entirely in theregion i, when the nodes in the region are set to the particular valuesdenoted by α_(i). Note that if there are higher-order tensor “links” inthe definition of the system, the region energy will still contain thoselinks that are contained.

The “region probability” P(α_(i)) is the marginal probability that thenodes in cluster i take on the particular values denoted by α_(i). Inour notation, the normalization constraint for region i is Σ_(α) _(i)P(α_(i))=1. That is, the sum over the probabilities of all possiblevalue assignments to the nodes of region i is one. An analogous equationholds for all regions. These are the Lagrange multiplier constraints γin equation [12], below.

The second set of constraints for the Gibbs free energy in equation [12]are the marginalization constraints, associated with Lagrangemultipliers λ. The Lagrange multipliers impose consistency constraintsbetween the probabilities of the regions and their intersectingsub-regions. The probability of a region, marginalized over all thenodes in the region not in some intersection region, must equal themarginal probability of the nodes in that intersection region. In ournotation, Σ_(α) _(i\j) P(α_(i))=P(α_(ij)), where Σ_(α) _(i\j) means thesum over all nodes in region i that are not in region j.

For the three clusters of nodes shown in FIGS. 4 b-c, we express theGibbs free energy in the Kikuchi approximation as G_(1,2,3). We havethree sets of U−ST terms, first a sum over states α_(i) for the clusterregion free energies, second the terms involving α_(ij) where wesubtract over-counted intersection regions, and a third term involvingα₁₂₃ where we correct for over-subtracting that intersection region.After all these corrections are made, each subset of each region iscounted exactly once, as it should be. $\begin{matrix}\begin{matrix}{G_{1,2,3} = {{\left( {U - {S\quad T}} \right)\quad{terms}} + {\gamma\quad{constraints}} + {\lambda\quad{constraints}}}} \\{= {{\sum\limits_{\alpha_{1}}{{P\left( \alpha_{1} \right)}{E\left( \alpha_{1} \right)}}} + {\sum\limits_{\alpha_{2}}{{P\left( \alpha_{2} \right)}{E\left( \alpha_{2} \right)}}} + {\sum\limits_{\alpha_{3}}{{P\left( \alpha_{3} \right)}{E\left( \alpha_{3} \right)}}} +}} \\{{T{\sum\limits_{\alpha_{1}}{{P\left( \alpha_{1} \right)}{\ln\left( {P\left( \alpha_{1} \right)} \right)}}}} + {T{\sum\limits_{\alpha_{2}}{{P\left( \alpha_{2} \right)}{\ln\left( {P\left( \alpha_{2} \right)} \right)}}}} +} \\{{T{\sum\limits_{\alpha_{3}}{{P\left( \alpha_{3} \right)}{\ln\left( {P\left( \alpha_{3} \right)} \right)}}}} - {\sum\limits_{\alpha_{12}}{{P\left( \alpha_{12} \right)}{E\left( \alpha_{12} \right)}}} -} \\{{\sum\limits_{\alpha_{23}}{P\left( \alpha_{23} \right){E\left( \alpha_{23} \right)}}} - {\sum\limits_{\alpha_{31}}{P\left( \alpha_{31} \right){E\left( \alpha_{31} \right)}}} -} \\{{T{\sum\limits_{\alpha_{12}}{{P\left( \alpha_{12} \right)}\ln\left( {P\left( \alpha_{12} \right)} \right)}}} - {T{\sum\limits_{\alpha_{23}}{{P\left( \alpha_{23} \right)}\ln\left( {P\left( \alpha_{23} \right)} \right)}}} -} \\{{T{\sum\limits_{\alpha_{31}}{{P\left( \alpha_{31} \right)}\ln\left( {P\left( \alpha_{31} \right)} \right)}}} + {T{\sum\limits_{\alpha_{123}}{{P\left( \alpha_{123} \right)}{E\left( \alpha_{123} \right)}}}} +} \\{{T{\sum\limits_{\alpha_{123}}{{P\left( \alpha_{123} \right)}{\ln\left( {P\left( \alpha_{123} \right)} \right)}}}} + {\gamma_{1}\left\lbrack {1 - {\sum\limits_{\alpha_{1}}{P\left( \alpha_{1} \right)}}} \right\rbrack} +} \\{{\gamma_{2}\left\lbrack {1 - {\sum\limits_{\alpha_{2}}{P\left( \alpha_{2} \right)}}} \right\rbrack} + {\gamma_{3}\left\lbrack {1 - {\sum\limits_{\alpha_{3}}{P\left( \alpha_{3} \right)}}} \right\rbrack} +} \\{{\gamma_{12}\left\lbrack {1 - {\sum\limits_{\alpha_{12}}{P\left( \alpha_{12} \right)}}} \right\rbrack} + {\gamma_{23}\left\lbrack {1 - {\sum\limits_{\alpha_{23}}{P\left( \alpha_{23} \right)}}} \right\rbrack} +} \\{{\gamma_{31}\left\lbrack {1 - {\sum\limits_{\alpha_{31}}{P\left( \alpha_{31} \right)}}} \right\rbrack} + {\gamma_{123}\left\lbrack {1 - {\sum\limits_{\alpha_{123}}{P\left( \alpha_{123} \right)}}} \right\rbrack} +} \\{{\sum\limits_{\alpha_{12}}{{\lambda_{1{\backslash 12}}\left( \alpha_{12} \right)}\left\lbrack {{P\left( \alpha_{12} \right)} - {\sum\limits_{\alpha_{1{\backslash 12}}}{P\left( \alpha_{1} \right)}}} \right\rbrack}} +} \\{{\sum\limits_{\alpha_{12}}{{\lambda_{2{\backslash 12}}\left( \alpha_{12} \right)}\left\lbrack {{P\left( \alpha_{12} \right)} - {\sum\limits_{\alpha_{2{\backslash 12}}}{P\left( \alpha_{2} \right)}}} \right\rbrack}} +} \\{{\sum\limits_{\alpha_{23}}{{\lambda_{2{\backslash 23}}\left( \alpha_{23} \right)}\left\lbrack {{P\left( \alpha_{23} \right)} - {\sum\limits_{\alpha_{2{\backslash 23}}}{P\left( \alpha_{2} \right)}}} \right\rbrack}} +} \\{{\sum\limits_{\alpha_{23}}{{\lambda_{3{\backslash 23}}\left( \alpha_{23} \right)}\left\lbrack {{P\left( \alpha_{23} \right)} - {\sum\limits_{\alpha_{3{\backslash 23}}}{P\left( \alpha_{3} \right)}}} \right\rbrack}} +} \\{{\sum\limits_{\alpha_{31}}{{\lambda_{3{\backslash 31}}\left( \alpha_{31} \right)}\left\lbrack {{P\left( \alpha_{31} \right)} - {\sum\limits_{\alpha_{3{\backslash 31}}}{P\left( \alpha_{3} \right)}}} \right\rbrack}} +} \\{{\sum\limits_{\alpha_{31}}{{\lambda_{1{\backslash 31}}\left( \alpha_{31} \right)}\left\lbrack {{P\left( \alpha_{31} \right)} - {\sum\limits_{\alpha_{1{\backslash 31}}}{P\left( \alpha_{1} \right)}}} \right\rbrack}} +} \\{{\sum\limits_{\alpha_{123}}{{\lambda_{{\backslash 12}\backslash\quad 123}\left( \alpha_{123} \right)}\left\lbrack {{P\left( \alpha_{123} \right)} - {\sum\limits_{\alpha_{12\backslash\quad 123}}{P\left( \alpha_{12} \right)}}} \right\rbrack}} +} \\{{\sum\limits_{\alpha_{123}}{{\lambda_{23\backslash\quad 123}\left( \alpha_{123} \right)}\left\lbrack {{P\left( \alpha_{123} \right)} - {\sum\limits_{\alpha_{23\backslash\quad 123}}{P\left( \alpha_{23} \right)}}} \right\rbrack}} +} \\{\sum\limits_{\alpha_{123}}{{{\lambda_{31\backslash\quad 123}\left( \alpha_{123} \right)}\left\lbrack {{P\left( \alpha_{123} \right)} - {\sum\limits_{\alpha_{31\backslash\quad 123}}{P\left( \alpha_{31} \right)}}} \right\rbrack}.}}\end{matrix} & \lbrack 12\rbrack\end{matrix}$

One point about the marginalization constraints. One might wonder why wedo not also have terms like$\sum\limits_{\alpha_{123}}{{\lambda_{1\backslash\quad 123}\left( \alpha_{123} \right)}\left\lbrack {{P\left( \alpha_{123} \right)} - {\sum\limits_{\alpha_{1\backslash\quad 123}}{P\left( \alpha_{1} \right)}}} \right\rbrack}$which enforce that region 1 properly marginalizes down to region 123. Weomit these terms because they are redundant with the marginalizationequations already included in G_(1,2,3.)

Marginalization is a linear operator. Of the three possible constraintsrelating regions i, ij, and ijk, that region i marginalize down toregion ij, that ij marginalize down to ijk, and that region imarginalize down to ijk, we only need two, and the third is linearlydependent on the other two equations.

The Kikuchi approximation to the Gibbs free energy can be generalized tothe case of an arbitrary collection of regions in a relativelystraightforward way, but we need to establish some additional morenotation. Let r denote a region chosen from the set of all relevantregions R. Let S(r) denote the set of sub-regions of r and T(r) denotethe set of super-regions of r. Define a “direct sub-region” of r as asub-region that does not have a super-region that is also a sub-regionof r. Thus, in our three cluster example, regions 12 and 13 are directsub-regions of region 1, but region 123 is not. Let S_(d)(r) denote theset of direct sub-regions of r and T_(d)(r) denote the set of directsuper-regions of r. Associate with each region r a “over-countingnumber” d_(r), defined recursively by $\begin{matrix}{d_{r} = {1 - {\sum\limits_{t \in {T{(r)}}}{d_{t}.}}}} & \lbrack 13\rbrack\end{matrix}$

The largest regions with no super-regions are defined to have adouble-counting number of one. In the three-cluster case, we haved₁=d₂=d₃=1, d₁₂=d₂₃=d₃₁=−1, and d₁₂₃=1. The over-counting number isneeded to insure that each link is ultimately counted exactly once.

The generalization of equation [12] for the Kikuchi approximation to theGibbs free energy for the case of an arbitrary collection of regions is:$\begin{matrix}\begin{matrix}{G = {\sum\limits_{r \in R}\left\lbrack {{d_{r}\left( {U_{r} - {T\quad S_{r}}} \right)} + {\gamma_{r}\left\lbrack {1 - {\sum\limits_{\alpha_{r}}{P\left( \alpha_{r} \right)}}} \right\rbrack} +} \right.}} \\{\left. {\sum\limits_{s \in {S_{d}{(r)}}}{\sum\limits_{\alpha_{s}}{{\lambda_{r\backslash s}\left( \alpha_{s} \right)}\left\lbrack {{P\left( \alpha_{s} \right)} - {\sum\limits_{\alpha_{r\backslash s}}{P\left( \alpha_{r} \right)}}} \right\rbrack}}} \right\rbrack,}\end{matrix} & \lbrack 14\rbrack\end{matrix}$where U_(r) and S_(r) are the internal energy and entropy of the regionr.

We want to find the cluster region probabilities that minimize theKikuchi free energy while obeying the marginalization and normalizationconstraints. These probabilities are the correct marginal probabilitiesunder the Kikuchi approximation.

To find equations for those probabilities, we differentiate equation[14] with respect to each of the region probabilities for each possibleassignment of values to a region and set the result to zero. Theresulting equations, illustrated for the three-cluster example of FIG. 4a, are:E(α₁)+T+T ln P(α₁)+γ₁=λ_(1\12)(α₁₂+λ_(1\31)(α₃₁)  [15]E(α₂)+T+T ln P(α₂)+γ₂=λ_(2\12)(α₁₂)+λ_(2\23)(α₂₃)  [16]E(α₃)+T+T ln P(α₃)+γ₃=λ_(3\23)(α₂₃)+λ_(2\31)(α₃₁)  [17]−E(α₁₂)−T−T lnP(α₁₂)+γ₁₂=−λ_(1\12)(α₁₂)−λ_(2\12)(α₁₂)+λ_(12\123)(α₁₂₃)  [18]−E(α₂₃)−T−T lnP(α₂₃)+γ₂₃=−λ_(2\23)(α₂₃)−λ_(3\23)(α₂₃)+λ_(23\123)(α₁₂₃)  [19]−E(α₃₁)−T−T lnP(α₃₁)+γ₃₁=−λ_(1\31)(α₃₁)−λ_(3\31)(α₃₁)+λ_(31\123)(α₁₂₃)  [20]E(α₁₂₃)+T+T lnP(α₁₂₃)+γ₁₂₃=−λ_(12\123)(α₁₂₃)−λ_(12\123)(α₁₂₃)−λ_(31\123)(α₁₂₃)  [21]Exponentiating both sides and rearranging terms of each equation, wehave the following set of equations for the marginal posteriorprobabilities: $\begin{matrix}{{P\left( \alpha_{1} \right)} = {k_{1}{\exp\left( \frac{- {E\left( \alpha_{1} \right)}}{T} \right)}{\exp\left( \frac{{\lambda_{1/12}\left( \alpha_{12} \right)} + {\lambda_{1{\backslash 31}}\left( \alpha_{31} \right)}}{T} \right)}}} & \lbrack 22\rbrack \\{{P\left( \alpha_{2} \right)} = {k_{2}{\exp\left( \frac{- {E\left( \alpha_{2} \right)}}{T} \right)}{\exp\left( \frac{{\lambda_{2/12}\left( \alpha_{12} \right)} + {\lambda_{2{\backslash 23}}\left( \alpha_{23} \right)}}{T} \right)}}} & \lbrack 23\rbrack \\{{P\left( \alpha_{3} \right)} = {k_{3}{\exp\left( \frac{- {E\left( \alpha_{3} \right)}}{T} \right)}{\exp\left( \frac{{\lambda_{3/23}\left( \alpha_{23} \right)} + {\lambda_{3{\backslash 31}}\left( \alpha_{31} \right)}}{T} \right)}}} & \lbrack 24\rbrack \\{{P\left( \alpha_{12} \right)} = {k_{12}{\exp\left( {- \frac{E\left( \alpha_{12} \right)}{T}} \right)}{\exp\left( \frac{\begin{matrix}{{\lambda_{1/12}\left( \alpha_{12} \right)} + {\lambda_{2{\backslash 12}}\left( \alpha_{12} \right)} -} \\{\lambda_{12\backslash\quad 123}\left( \alpha_{123} \right)}\end{matrix}}{T} \right)}}} & \lbrack 25\rbrack \\{{P\left( \alpha_{23} \right)} = {k_{23}{\exp\left( {- \frac{E\left( \alpha_{23} \right)}{T}} \right)}{\exp\left( \frac{\begin{matrix}{{\lambda_{2/23}\left( \alpha_{23} \right)} + {\lambda_{3{\backslash 23}}\left( \alpha_{23} \right)} -} \\{\lambda_{23\backslash\quad 123}\left( \alpha_{123} \right)}\end{matrix}}{T} \right)}}} & \lbrack 26\rbrack \\{{P\left( \alpha_{31} \right)} = {k_{31}{\exp\left( {- \frac{E\left( \alpha_{31} \right)}{T}} \right)}{\exp\left( \frac{\begin{matrix}{{\lambda_{3/31}\left( \alpha_{31} \right)} + {\lambda_{1{\backslash 31}}\left( \alpha_{31} \right)} -} \\{\lambda_{31\backslash\quad 123}\left( \alpha_{123} \right)}\end{matrix}}{T} \right)}}} & \lbrack 27\rbrack \\{{{P\left( \alpha_{123} \right)} = {k_{123}{\exp\left( {- \frac{E\left( \alpha_{123} \right)}{T}} \right)}{\exp\left( \frac{\begin{matrix}{{- {\lambda_{12/123}\left( \alpha_{123} \right)}} -} \\{{\lambda_{23\backslash\quad 123}\left( \alpha_{123} \right)} -} \\{\lambda_{31\backslash\quad 123}\left( \alpha_{123} \right)}\end{matrix}}{T} \right)}}},} & \lbrack 28\rbrack\end{matrix}$where the k's are normalization constants, different for each equation,that normalize each marginal probability. Each of these equations isactually a collection of equations. Thus, if we focus on equation [22],we get one equation for each instantiation of α₁. On the right hand sideof this equation, α₁₂ and α₃₁ are completely determined by α₁, becauseregions 12 and 31 are sub-regions of region 1.

Of course, we also need to differentiate the Gibb's free energy withrespect to the λ and γ Lagrange multipliers. Doing that gives us thedesired marginalization and normalization conditions for the regionprobabilities.

From Lagrange Multipliers to Messages

In the previous section, we derived a set of self-consistent equationsfor region probabilities and Lagrange multipliers by differentiating theKikuchi approximation of the Gibbs free energy. These equations are asolution to the problem of finding marginal posterior probabilities atthe level of the Kikuchi approximation.

In our new method, we develop equivalent and more convenientformulations of the same solution, in terms of a convergent“message-passing” method. By defining 320 the Lagrange multipliers, seestep 320, in terms of another set of quantities, called “messages,” weobtain a different but equivalent set of equations for the clusterprobabilities.

Our method works by treating the self-consistent equations as messageupdate rules that are iterated until a termination condition is reached,for example, a fixed number of iterations, or the solution converges330, at which point the marginal posterior probabilities 332 can bedetermined from the converged messages 340. Before the first update, weassign initial values 331 to the messages 321. The initial values can berandom or special values, e.g., all random positive numbers, or allones.

In general, there are many equivalent sets of message definitions andmessage-update rules that correspond to the same Kikuchi approximation.Some particular message definitions and message-update rules, which wedescribe later, have particularly nice properties and reduce to thealready known ordinary message-passing methods when applied to the Betheapproximation. In this section, we want to describe the generalproperties that should be satisfied by messages so that any particularmessage-passing method is equivalent to the corresponding Kikuchiapproximation.

We have used the marginalization relations to relate regionprobabilities to probabilities of their intersection regions. For thethree-cluster example, equations [22]-[28] describe how theprobabilities are related to the region energies, and the normalizationand marginalization Lagrange multipliers. We ignore the normalizationLagrange multipliers (the γ's) for now, knowing that these simply give amultiplicative scalar which normalizes the final probabilities.

Our goal is to express messages in terms of the λ Lagrange multipliers.It will turn out that the marginalization constraint relations will thengive us a set of fixed point equations for the messages, which we willinterpret as the message update rules 321.

We stack all of the λ Lagrange multipliers to create a vector {rightarrow over (λ)}. In this vector, there is one λ Lagrange multiplier foreach linearly independent marginalization constraint, times the numberof different possible variable assignments in the cluster region beingmarginalized down. For our three cluster example, the vector {rightarrow over (μ)} includes nine sets of Lagrange multipliers:λ_(1\12)(α₁₂), λ_(1\31)(α₃₁), λ_(2/12)(α₁₂), λ_(2/23)(α₂₃),λ_(3/23)(α₂₃), λ_(3/31)(α₃₁), λ_(12/123)(α₁₂₃), λ_(23/23)(α₁₂₃), andλ_(31/123)(α₁₂₃), which we take to be stacked in that order. The totaldimensionality of the vector {right arrow over (λ)} isD_(λ)=2D₁₂+2D₂₃+2D₃₁+3 D₁₂₃, where, for example, D₁₂ is thedimensionality of α₁₂.

In our example, equations [22-28] give the cluster probabilities interms of the Lagrange multipliers {right arrow over (λ)}. We take thelogarithm of all these equations, and define the local quantities

 L(α)=E(α)+T ln P(α)  [29]

for each cluster. Then we summarize these equations by:{right arrow over (L)}=Aλ,  [30]where {right arrow over (L)} is a vector of the L's in the seven regionstaken in the order 1, 2, 3, 12, 23, 31, and 123, and {right arrow over(L)} has dimensionality:D _(L) ≡D ₁ +D ₂ +D ₃ +D ₁₂ +D ₂₃ +D ₃₁ +D ₁₂₃.A is a matrix with D_(L) rows and D_(λ) columns, but it has a specialblock form, which we express as: $\begin{matrix}{A = {\begin{pmatrix}\overset{\sim}{1} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} \\\overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} \\\overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} \\\overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & {- \overset{\sim}{1}} & \overset{\sim}{0} & \overset{\sim}{0} \\\overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & {- \overset{\sim}{1}} & \overset{\sim}{0} \\\overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & {- \overset{\sim}{1}} \\\overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & {- \overset{\sim}{1}} & {- \overset{\sim}{1}} & {- \overset{\sim}{1}}\end{pmatrix}.}} & \lbrack 31\rbrack\end{matrix}$

Each entry in this matrix actually denotes a sub-matrix. For example,the {tilde over (1)} in the upper left corner denotes a sub-matrix of D₁rows and D₁₂ columns which is equal to one when the nodes in α₁₂ agreewith the corresponding nodes in α₁, and is equal to zero otherwise. Thisbecomes clear in a more specific example.

If region 1 includes two nodes S_(a) and S_(b), and region 2 includesthe single node S_(b), and each node can be in two states that we label“1” and “2,” then α₁ can be in any of the four states S_(a)=1, S_(b)=1;S_(a)=1, S_(b)=2; S_(a)=2, S_(b)=1; S_(a)=2, S_(b)=2 and α₁₂ can be anyof the two states S_(b)=1; S_(b)=2, and the sub-matrix {tilde over (1)}in the top left corner of A is expressed as: $\begin{pmatrix}1 & 0 \\0 & 1 \\1 & 0 \\0 & 1\end{pmatrix}.$

In general, the self-consistent equations derived from a Kikuchiapproximation can be summarized in the form of a matrix equation {rightarrow over (L)}=A{right arrow over (λ)} and a set of normalization andmarginalization conditions.

We desire an alternative formulation for our solution in terms of“messages” and self-consistent message equations. We still use the samenormalization and marginalization conditions on the clusterprobabilities, but we redefine the Lagrange multipliers A in terms ofmessages.

We store the logarithms of all the messages between the regions in anordered vector {right arrow over (m)}. To be commensurate with theLagrange multipliers that we seek to summarize, we normally let thedimensionality of {right arrow over (m)} be the same as thedimensionality of {right arrow over (λ)}. In general, the vector {rightarrow over (m)} can have different dimensionality than {right arrow over(λ)}, but the vector needs to span the same sub-space of constraints onthe lattice so that we can re-write the Lagrange multiplier {right arrowover (λ)} in terms of messages {right arrow over (m)}.

Thus, for our example, we generate sets of logarithms of messagescommensurate with {right arrow over (λ)} as m_(1\12)(α₁₂),m_(1/31)(α₃₁), m_(2\12)(α₁₂), m_(2\23)(α₂₃), m_(3\23)(α₂3),m_(3\31)(α₃₁), M_(12\123)(α₁₂₃), M_(23\123)(α₁₂₃), and m_(31\123)(α₁₂₃).The vector m stores these messages in that order.

There is a set of self-consistent equations for the clusterprobabilities in terms of the messages where the equations have thematrix form:L=B{right arrow over (m)},  [32]or equivalently:A{right arrow over (λ)}=B{right arrow over (m)},  [33]where B is a “selection” matrix that determines the choice of themessage-passing method. This matrix is specified below.

The matrix B and the vector {right arrow over (m)} constitute a validreformulation of the Kikuchi solution as long as the system described byequation [33] has at least one solution that can be written in the formof a matrix equation:{right arrow over (m)}=C{right arrow over (λ)}.  [34]

Any valid reformulation of the Kikuchi solution can be used as the basisfor our message-passing method.

To illustrate this idea, we return to our three-cluster example of FIG.4 a, for which we have already chosen the vector {right arrow over (m)}.We can now choose many different matrices B, but we use one that arisesfrom a scheme that we describe in more detail below.

Recall that the m's defined above are the logarithms of messages, whichwe denote by M's. We now set the marginal posterior probabilities of theregions to obey the set of equations: $\begin{matrix}{\quad{{P\left( \alpha_{1} \right)} = {k_{1}{\exp\left( {- \frac{E\left( \alpha_{1} \right)}{T}} \right)}{M_{3{\backslash 31}}\left( \alpha_{31} \right)}{M_{2{\backslash 12}}\left( \alpha_{12} \right)}{M_{23\backslash\quad 123}\left( \alpha_{123} \right)}}}} & \lbrack 35\rbrack \\{\quad{{P\left( \alpha_{2} \right)} = {k_{2}{\exp\left( {- \frac{E\left( \alpha_{2} \right)}{T}} \right)}{M_{1{\backslash 12}}\left( \alpha_{12} \right)}{M_{3{\backslash 23}}\left( \alpha_{23} \right)}{M_{31\backslash\quad 123}\left( \alpha_{123} \right)}}}} & \lbrack 36\rbrack \\{\quad{{P\left( \alpha_{3} \right)} = {k_{3}{\exp\left( {- \frac{E\left( \alpha_{3} \right)}{T}} \right)}{M_{1{\backslash 31}}\left( \alpha_{31} \right)}{M_{2{\backslash 23}}\left( \alpha_{23} \right)}{M_{23\backslash\quad 123}\left( \alpha_{123} \right)}}}} & \lbrack 37\rbrack \\{{P\left( \alpha_{12} \right)} = {k_{12}{\exp\left( {- \frac{E\left( \alpha_{12} \right)}{T}} \right)}{M_{1{\backslash 12}}\left( \alpha_{12} \right)}{M_{2{\backslash 12}}\left( \alpha_{12} \right)}{M_{23\backslash\quad 123}\left( \alpha_{123} \right)}{M_{31\backslash\quad 123}\left( \alpha_{123} \right)}}} & \lbrack 38\rbrack \\{{P\left( \alpha_{23} \right)} = {k_{23}{\exp\left( {- \frac{E\left( \alpha_{23} \right)}{T}} \right)}{M_{3{\backslash 23}}\left( \alpha_{23} \right)}{M_{3{\backslash 23}}\left( \alpha_{23} \right)}{M_{12\backslash\quad 123}\left( \alpha_{123} \right)}{M_{31\backslash\quad 123}\left( \alpha_{123} \right)}}} & \lbrack 39\rbrack \\{{P\left( \alpha_{31} \right)} = {k_{31}{\exp\left( {- \frac{E\left( \alpha_{31} \right)}{T}} \right)}{M_{1{\backslash 31}}\left( \alpha_{31} \right)}{M_{3{\backslash 31}}\left( \alpha_{31} \right)}{M_{123\backslash\quad 123}\left( \alpha_{123} \right)}{M_{23\backslash\quad 123}\left( \alpha_{123} \right)}}} & \lbrack 40\rbrack \\{{{p\left( \alpha_{123} \right)} = {k_{123}{\exp\left( {- \frac{E\left( \alpha_{123} \right)}{T}} \right)}{M_{12\backslash\quad 123}\left( \alpha_{23} \right)}{M_{23\backslash\quad 123}\left( \alpha_{123} \right)}{M_{31\backslash\quad 123}\left( \alpha_{123} \right)}}},} & \lbrack 41\rbrack\end{matrix}$where again the k's are normalization constants that vary for eachequation, as for equations [22-28] above.

Taking the logarithm of these equations, and recalling the order of them's given above, we see that equations [35-41] are equivalent to thechoice $\begin{matrix}{B = {\begin{pmatrix}\overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} \\\overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} \\\overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} \\\overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{1} \\\overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{1} \\\overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{1} & \overset{\sim}{1} & \overset{\sim}{0} \\\overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{1} & \overset{\sim}{1}\end{pmatrix}.}} & \lbrack 42\rbrack\end{matrix}$

The choice of B and {right arrow over (m)} give an equivalentreformulation of the region probabilities when there is a solution tothe equation:{right arrow over (m)}=C{right arrow over (λ)}  [43]whereC=B ⁺ A,  [44]and B⁺ is the pseudo-inverse of B. Of course, we can always compute a Cmatrix as above, but for the matrix to represent a true solution, wemust verify that matrix equationBC=A  [45]is satisfied.

If a solution to A{right arrow over (λ)}=B{right arrow over (m)} existsfor a particular choice of matrix B and message vectors {right arrowover (m)}, we discard the original Lagrange multipliers λ and workentirely with the messages. In this case, we guarantee that theself-consistent equations for the messages will ultimately give the samecluster probabilities as those given by the Kikuchi approximation.

The next step is to determine the self-consistent equations for themessages. To do this, we need to combine the equations for the clusterprobabilities in terms of messages with the marginalization conditionson the cluster probabilities. In terms of our example, we use, forinstance, the equation P(α₁₂)=Σ_(α) _(1\12) P(α₁) along with theequations for P(α₁) and P(α₁₂) in terms of messages to obtain$\begin{matrix}{{{M_{1{\backslash 12}}M_{3\backslash\quad 123}} = {k{\sum\limits_{\alpha_{1{\backslash 12}}}{{\exp\left( \frac{{E\left( \alpha_{12} \right)} - {E\left( \alpha_{1} \right)}}{T} \right)}M_{3{\backslash 31}}}}}},} & \lbrack 46\rbrack\end{matrix}$where k is again a normalization constant. In all, there will be onesuch equation for each marginalization constraint.

We use the D_(λ) equations of the form of equation [46] in ourmessage-passing method, by interpreting them as message-update rules. Inthis interpretation, any messages on the right-hand-side of an equationare treated as the messages at iteration t, while the messages on theleft-hand-side of the equation are the desired messages at iterationt+1. We start, for example, with random messages 331, and at eachiteration replace the “old” messages on the right hand side of theequation with “new” messages from the left-hand side. In practice, themessages usually converge to a fixed point. When they converge, thefixed point corresponds to the solution in which we are interested.

As for all methods of this type, we can update the messagessynchronously or asynchronously, and we can update the messages withsome linear combination of the “old” and “new” messages. Our method isan iterative method for the solution of nonlinear equations, and assuch, its convergence properties may be improved by using standardprior-art techniques for such methods, see “Iterative Methods for Linearand Nonlinear Equations”, C. T. Kelley, 1995.

Regarding the normalization constants k, we could continue to set thesenormalization constants by requiring that the total clusterprobabilities sum to one. However, another procedure that works equallywell, and is easy to implement, temporarily scales the messages byrequiring that the sum of all the components of a message to equal agiven arbitrary constant while iterating the self-consistent equations.Reasonable choices for the constant are one, or the dimensionality ofthe message. Thus, for instance, we can require that: $\begin{matrix}{{\sum\limits_{\alpha_{12}}{M_{1{\backslash 12}}\left( \alpha_{12} \right)}} = 1.} & \lbrack 47\rbrack\end{matrix}$

After iterating the message equations to convergence, we re-scale themessages to guarantee that the region probabilities are properlynormalized.

We have described the messages as being associated with a region r andone of its direct sub-regions s. We can equivalently describe themessage as going from a set of “source” nodes to a set of “destination”nodes. The “source” nodes are those nodes in region r that are not indirect sub-region s, while the “destination” nodes are those nodes in s.

To obtain probability functions defined on a subset of the nodes in aregion, one simply marginalizes the region probability by summing itover those nodes that are in the region but not in the subset ofinterest.

We can also determine the maximum a posteriori probability values of thenodes in a region rather than the region probabilities. For theprior-art “standard” belief propagation method, a simple modificationsuffices to obtain the approximate MAP values of the nodes. Thenecessary modification is to replace all sums over node values in themessage-update rules with the value of the maximum, i.e., the “argmax,”over the same node values. The same modification can also be used toobtain the MAP values of regions for our method. Alternatively, we canuse the physics formulation of the Markov network, and set thetemperature T to a very small positive constant. In the limit T→0, werecover region probabilities that are entirely dominated by the MAPvalues of the nodes.

The message-update equations for some arbitrary B matrix is not, ingeneral, convenient to solve. With an arbitrary B matrix, one might findthat a message between two nodes on one side of the network would dependon messages between other nodes that are very far away. This is anundesirable property, both because it seems unintuitive, and because itmeans that solving the message-update equations will involve matrixmultiplications with very large matrices. Finding the free energy forarbitrary B matrices also involve large matrix multiplications orinversions.

In order that iterating the message update equations be an efficientoperation, we desire that the matrix B is sparse and local, i.e., only afew non-zero elements for each row or column, and only nearby messagesinfluence each other.

In the next section, we describe a “canonical” message-passing methodthat yields these desired properties.

A Canonical Method

In the previous section, we described the properties that must besatisfied by messages and their associated message-update rules, whichwe call the “message-passing method,” in order that the messages giveresults equivalent to the Kikuchi approximation. In general, there maybe many different message-passing methods that are all equivalent to thesame Kikuchi approximation. In this section, we describe how to selectone particular message-passing method that we call the “canonical”method.

Our canonical method gives the same message-update rules as the standardloopy message-passing method when applied at the level of the Betheapproximation. More generally, and as an advantage, our canonical methodleads to local message-update rules that are relatively easy andconvenient to solve because they correspond to a sparse B matrix.

We start with a Markov network with local compatibility functions φ_(ij)and evidence functions ψ_(i). The overall probability of a configurationof states s_(a), s_(b), . . . is expressed as. $\begin{matrix}{{{P\left( {s_{a},s_{b},\ldots} \right)} = {\frac{1}{Z}{\prod\limits_{i,j}{{\phi_{ij}\left( {s_{i},s_{j}} \right)}{\prod\limits_{i}{\psi_{i}\left( s_{i} \right)}}}}}},} & \lbrack 48\rbrack\end{matrix}$where the first product runs over all linked neighbors, i and j.

Using FIG. 5, we illustrate this procedure with a specific example. Inthis example, nine nodes labeled a through i, are arranged in a 3×3lattice 500. The only statistical dependencies or linksφ_(ij)(s_(i),s_(j)) included are the ones connecting nearest neighborsin the lattice.

General Procedure

The detailed steps of a general method is described with respect to FIG.6. In step 601, a network 600 of nodes and links is grouped 601 intoclusters or “super-nodes” 650 to be used in the approximation. Theseclusters can have a simple geometric shape, repeated over the lattice.Alternatively, the clusters can be triangles or squares of an irregularnetwork, or the clusters can be some other arbitrary choice of nodeclusters best suited for the approximation. The set of clusters mustcover all the links in the Markov network, and the clusters must besmall enough so that we are able to perform the generalized beliefpropagation operations described below.

In our example, we group the network 500 into clusters: (abde), (bcef),(degh), and (efhi) 501-504 of FIG. 5. This set of clusters “covers” thelattice 500. That is, every link between nodes which influence eachother is completely included in at least one of the four clusters wehave selected. Of course, we can select a different collection ofclusters which completely covers the lattice. That would just be adifferent Kikuchi approximation.

As stated before, the size of the clusters is proportional to theaccuracy of the approximation. If we chose a cluster (abcdefghi), i.e.,all the nodes in the system, then our approximation is exact. While thatgrouping may be feasible with such a small example, in general, we willperform summations whose number grows exponentially with the size of thecluster. Therefore, in a practical application, the number of nodes in acluster is limited.

The prior art Bethe approximation is obtained when we select as theclusters all the pairs of nodes that influence each other. In ourexample, that would mean partitioning the nodes pairwise into clusters(ab), (bc), (de), (ef), (gh), (hi), (ad), (be), (cf), (dg), (eh), and(fi). Because our four clusters are larger than the twelve clusters ofthe Bethe approximation, we obtain more accurate results.

In step 602, we identify a collection of relevant regions 651 byrecursively generating intersections of clusters, for our example, theregions, (be), (de) (fe), (he), and (e). We discard duplicate regions.

In step 603, we arrange the regions into a top-to-bottom hierarchy 652of their intersections. The hierarchy for our specific example is shownin more detail in FIG. 7. This hierarchy has the property that eachregion is connected only to its “direct sub-regions” and “directsuper-regions.” A sub-region s is a “direct sub-region” of anotherregion r if there are no other regions in our collection that aresuper-regions of s and sub-regions of r. In our example, regions (be)and (de) are both direct sub-regions of region (abde), but region (e) isnot. Region r is a direct super-region of region s if and only if regions is a direct sub-region of region r.

In step 604, we identify messages M 653 which connect all regions withtheir direct sub-regions. In our example, we would identify messagesM_(abde\be), M_(abde\de), M_(bcef\be), M_(bcef\fe), M_(degh\de),M_(degh he), M_(efhi\fe), M_(efhi\he), M_(be\e), M_(de\e), M_(fe\e), andM_(he\e).

We can identify these messages using a more compact notation that weillustrate with one example:M _(be) ^(ad) ≡M _(abde\be).  [47]

For this message, we send a message from the upper nodes a and d to thelower nodes b and e. We can consider the upper nodes to be the “source”nodes for the message, and the lower nodes to be the “destination”nodes. Each of the messages between a region and its direct sub-regionhas dimensionality corresponding to the dimensionality of thesub-region. Thus, M_(be) ^(ad) has the same dimensionality as region(be). When we want to emphasize that fact, we express the message asM_(be) ^(ad)(S_(b), s_(e)).

In step 605, we construct a probability function P(α_(R)) and a localpotential Φ(α_(R)) for each region R. We construct a vector {right arrowover (L)} out of the quantities L(α)≡ln P(α)−ln Φ(α). For our example,we need to construct probabilities P(s_(a),s_(b),s_(d),s_(e)),P(s_(b),s_(c),s_(e),s_(f)), P(s_(d),s_(e),s_(g),s_(h)),P(s_(e),s_(f),s_(h),s_(i)), P(s_(b),s_(e)), P(s_(d),s_(e)),P(s_(f),s_(e)), P(s_(h),s_(e)) and P(s_(e)). The local potentialsΦ(α_(R)) are constructed by combining all the original potentials φ_(ij)and ψ_(i) that belong to the region R. Thus, for example,Φ(s _(a) ,s _(b) ,s _(d) ,s _(e))=φ_(ab)(s _(a) ,s _(b))φ_(ad)(s _(a) ,s_(d)) φ_(be)(s _(b) ,s _(e))φ_(de)(s _(d) ,s _(e))ψ_(a)(s _(a))ψ_(b)(s_(b))ψ_(d)(s _(d))ψ_(e)(s _(e)),  [48]while Φ(s_(b),s_(e))=φ_(be)(s_(b),s_(e))ψ_(b)(s_(b))ψ_(e)(s_(e)) andΦ(s_(e))=ψ(s_(e))

Each probability function has the dimensionality of the region itdescribes. For example, if node b can be in any of three states and nodee can be in any of four states, then P(α_(be)) is described by 3*4=12numbers.

In step 606, we specify the matrix B 656 that determines the selectionof the message-passing method. This matrix will operate on the vector{right arrow over (m)} created from the logarithms of the messagesspecified in step 604. The matrix B has a block form that we explainedwhen describing equation [33]. This special block form “hides” thedetails of the different possible values of the nodes within eachmessage or region. Using this block form, the matrix B 655 has N_(R)rows and N_(M) columns, where N_(R) is the number of regions, and N_(M)is the number of messages. Each entry in the matrix is either a {tildeover (0)} matrix or a {tilde over (1)} matrix. We determine the entriesof the matrix B by the following sub-step 606 for each entry:

(a) Label the region corresponding to the row of the entry R_(R), andthe region and sub-region of the message corresponding to the column ofthe entry R_(C) and S_(C) respectively. By R_(C)\S_(C), we denote theset of nodes that are in R_(C) but not in S_(C).

(b) If the nodes in SC are the same as the nodes in R_(R), or if thenodes are a sub-set of the nodes in R_(R), and if all the nodes inR_(C)\S_(C) are not in R_(R), then the entry is a {tilde over (1)}matrix. Otherwise, the entry is a {tilde over (0)} matrix.

Explained more intuitively, this rule says that every region getsmessages that start at source nodes that are outside of the region andend at destination nodes that are inside of the region. In our example,assuming the regions are listed in the order (abde), (bcef), (degh),(efhi), (be), (de), (fe), (he), (e), and the messages are ordered as welisted them above, the matrix B is: $\begin{matrix}{B = {\begin{pmatrix}\overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{1} \\\overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} \\\overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} \\\overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} \\\overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{1} & \overset{\sim}{1} \\\overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{1} \\\overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{1} \\\overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{1} & \overset{\sim}{1} & \overset{\sim}{1} & \overset{\sim}{0} \\\overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{0} & \overset{\sim}{1} & \overset{\sim}{1} & \overset{\sim}{1} & \overset{\sim}{1}\end{pmatrix}.}} & \lbrack 49\rbrack\end{matrix}$

We explain one of these entries: the {tilde over (1)} matrix in thefirst row, third column. The first row corresponds to the region (abde),while the third column corresponds to the message M_(be) ^(cf). Both thenodes c and f are outside of the region (abde) and both the nodes b ande are inside the region (abde), so the conditions for choosing a {tildeover (1)} matrix are met for this entry.

One important advantage of this rule is that when it is applied to themessages and regions obtained from the Bethe approximation, the rulewill always result in a matrix B that will ultimately yield the standard“loopy” belief propagation. Another important advantage is that thisrule will normally give back a matrix B that results in a system ofequations A{right arrow over (λ)}==B{right arrow over (m)} that issolvable. Another advantage is that this rule will result in a B matrixthat is sparse and local.

In step 607, we exponentiate the equations represented by the matrixequation {right arrow over (L)}=B{right arrow over (m)} to obtain theequations 656 relating the region probabilities to the messages. For ourexample, we obtain nine equations from the nine rows of the matrixequation. As an example, we show how to construct the equation for thefirst row. The first row of the B matrix corresponds to the region(abde). Reading across the first row, we see that there are {tilde over(1)} sub-matrices in the third, fifth, eleventh, and twelfth columns,which correspond to the messages M_(be) ^(cf), M_(de) ^(gh), M_(e) ^(f),and M_(e) ^(h) Therefore, the equation for the first row is:P(s _(a) , s _(b) ,s _(d) ,s _(e))=Φ(s _(a) ,s _(b) ,s _(d) ,s _(e))M_(be) ^(cf)(s_(b) ,s _(e))M _(de) ^(gh)(s _(d) ,s _(e))M _(e) ^(f)(s_(e))M _(e) ^(h)(s _(e)).  [50]The other equations are constructed in the same way. For example, theequation for the fifth row isP(i s_(b) ,s _(e))=Φ(i s_(b) ,s _(e))M _(be) ^(ad)(i s_(b) ,s _(e))M_(be) ^(cf)(s _(b) ,s _(e))M _(e) ^(d)(s _(e))M _(e) ^(f)(s _(e))M _(e)^(h)(s _(e))  [51]In step 608, we express the N_(m) constraint equations on the regionprobabilities, and substitute in equations relating the regionprobabilities to the messages in order to obtain a set ofself-consistent equations M_(c) (update rules) for messages 657.

In our example, we have twelve constraint equations. In the canonicalmethod, there are always exactly as many constraint equations as thereare messages. One of the constraint equations in our example relatesP(s_(a),s_(b),s_(d),s_(e)) to P(s_(b),s_(e)): $\begin{matrix}{{P\left( {s_{b},s_{e}} \right)} = {\sum\limits_{s_{b},s_{e}}{{P\left( {s_{a},s_{b},s_{d},s_{e}} \right)}.}}} & \lbrack 52\rbrack\end{matrix}$

Using the results from the previous step, we obtain the self-consistentequation for the messages from the marginalization (abde)→(be):$\begin{matrix}{{{M_{be}^{ad}\left( {s_{b},s_{e}} \right)}{M_{e}^{d}\left( s_{e} \right)}} = {\sum\limits_{s_{a}s_{d}}{{\phi_{ab}\left( {s_{a},s_{b}} \right)}{\phi_{ad}\left( {s_{a},s_{d}} \right)}{\phi_{be}\left( {s_{b},s_{e}} \right)}{\psi_{a}\left( s_{a} \right)}{\psi_{d}\left( s_{d} \right)}{{M_{de}^{gh}\left( {s_{d},s_{e}} \right)}.}}}} & \lbrack 53\rbrack\end{matrix}$

Eleven other similar equations can be obtained from the constraintequations. From the marginalization of region (abde)→(de) we obtain:$\begin{matrix}{{{M_{de}^{ad}\left( {s_{d},s_{e}} \right)}{M_{e}^{b}\left( s_{e} \right)}} = {\sum\limits_{s_{a}s_{b}}{{\phi_{ad}\left( {s_{a},s_{d}} \right)}{\phi_{ab}\left( {s_{a},s_{b}} \right)}{\phi_{de}\left( {s_{d},s_{e}} \right)}{\psi_{a}\left( s_{a} \right)}{\psi_{b}\left( s_{b} \right)}{M_{be}^{cf}\left( {s_{d},s_{e}} \right)}}}} & \lbrack 54\rbrack \\{{{From}\quad({bcef})}->{({be}):}} & \quad \\{{{M_{be}^{cf}\left( {s_{b},s_{e}} \right)}{M_{e}^{f}\left( s_{e} \right)}} = {\sum\limits_{s_{c}s_{f}}{{\phi_{bc}\left( {s_{b},s_{c}} \right)}{\phi_{cf}\left( {s_{c},s_{f}} \right)}{\phi_{be}\left( {s_{b},s_{e}} \right)}{\psi_{c}\left( s_{c} \right)}{\psi_{f}\left( s_{f} \right)}{M_{fe}^{ik}\left( {s_{f},s_{e}} \right)}}}} & \lbrack 55\rbrack \\{{{From}\quad({bcef})}->{({fe}):}} & \quad \\{{{M_{fe}^{cb}\left( {s_{f},s_{e}} \right)}{M_{e}^{b}\left( s_{e} \right)}} = {\sum\limits_{s_{c}s_{b}}{{\phi_{fc}\left( {s_{f},s_{c}} \right)}{\phi_{cb}\left( {s_{c},s_{b}} \right)}{\phi_{fe}\left( {s_{f},s_{e}} \right)}{\psi_{b}\left( s_{b} \right)}{\psi_{f}\left( s_{f} \right)}{M_{be}^{ad}\left( {s_{b},s_{e}} \right)}}}} & \lbrack 56\rbrack \\{{{From}\quad({degh})}->{({de}):}} & \quad \\{{{M_{de}^{gh}\left( {s_{d},s_{e}} \right)}{M_{e}^{h}\left( s_{e} \right)}} = {\sum\limits_{s_{g}s_{h}}{{\phi_{d\quad g}\left( {s_{d},s_{g}} \right)}{\phi_{gh}\left( {s_{g},s_{h}} \right)}{\phi_{hc}\left( {s_{h},s_{c}} \right)}{\psi_{g}\left( s_{g} \right)}{\psi_{h}\left( s_{h} \right)}{M_{he}^{if}\left( {s_{h},s_{c}} \right)}}}} & \lbrack 57\rbrack \\{{{From}\quad({degh})}->{({he}):}} & \quad \\{{{M_{he}^{gd}\left( {s_{h},s_{c}} \right)}{M_{c}^{d}\left( s_{e} \right)}} = {\sum\limits_{s_{g}s_{d}}{{\phi_{gh}\left( {s_{g},s_{h}} \right)}{\phi_{gd}\left( {s_{g},s_{d}} \right)}{\phi_{de}\left( {s_{d},s_{e}} \right)}{\psi_{d}\left( s_{d} \right)}{\psi_{g}\left( s_{g} \right)}{M_{d\quad c}^{ab}\left( {s_{d},s_{c}} \right)}}}} & \lbrack 58\rbrack \\{{{From}\quad({efhi})}->{({fe}):}} & \quad \\{{{M_{fe}^{ih}\left( {s_{f},s_{e}} \right)}{M_{e}^{h}\left( s_{c} \right)}} = {\sum\limits_{s_{i}s_{h}}{{\phi_{fi}\left( {s_{f},s_{i}} \right)}{\phi_{hi}\left( {s_{h},s_{i}} \right)}{\phi_{fe}\left( {s_{f},s_{e}} \right)}{\psi_{h}\left( s_{h} \right)}{\psi_{i}\left( s_{i} \right)}{M_{he}^{gd}\left( {s_{h},s_{e}} \right)}}}} & \lbrack 59\rbrack \\{{{From}\quad({efhi})}->{({he}):}} & \quad \\{{{M_{he}^{fi}\left( {s_{h},s_{e}} \right)}{M_{e}^{f}\left( s_{e} \right)}} = {\sum\limits_{s_{i}s_{j}}{{\phi_{hi}\left( {s_{h},s_{i}} \right)}{\phi_{fi}\left( {s_{f},s_{i}} \right)}{\phi_{fe}\left( {s_{f},s_{e}} \right)}{\psi_{f}\left( s_{f} \right)}{\psi_{i}\left( s_{i} \right)}{M_{fe}^{cb}\left( {s_{f},s_{e}} \right)}}}} & \lbrack 60\rbrack \\{{{From}\quad({be})}->{(e):}} & \quad \\{{M_{c}^{b}(e)} = {\sum\limits_{s_{b}}{{\phi_{be}\left( {s_{b},s_{e}} \right)}{\psi_{b}\left( s_{b} \right)}{M_{be}^{ad}\left( {s_{b},s_{e}} \right)}{M_{be}^{cf}\left( {s_{b},s_{e}} \right)}{M_{b}^{a}\left( s_{b} \right)}{M_{b}^{c}\left( s_{b} \right)}}}} & \lbrack 61\rbrack \\{{{From}\quad({de})}->{(e):}} & \quad \\{{M_{e}^{d}(e)} = {\sum\limits_{s_{d}}{{\phi_{de}\left( {s_{d},s_{e}} \right)}{\psi_{d}\left( s_{d} \right)}{M_{de}^{ab}\left( {s_{d},s_{e}} \right)}{M_{de}^{gf}\left( {s_{d},s_{e}} \right)}{M_{d}^{a}\left( s_{d} \right)}{M_{d}^{g}\left( s_{d} \right)}}}} & \lbrack 62\rbrack \\{{{From}\quad({fe})}->{(e):}} & \quad \\{{M_{e}^{f}(e)} = {\sum\limits_{s_{f}}{{\phi_{fe}\left( {s_{f},s_{e}} \right)}{\psi_{f}\left( s_{f} \right)}{M_{fe}^{cb}\left( {s_{f},s_{e}} \right)}{M_{fe}^{ih}\left( {s_{f},s_{e}} \right)}{M_{f}^{c}\left( s_{f} \right)}{M_{f}^{i}\left( s_{f} \right)}}}} & \lbrack 63\rbrack \\{{{From}\quad({he})}->{(e):}} & \quad \\{{M_{c}^{h}(e)} = {\sum\limits_{s_{h}}{{\phi_{he}\left( {s_{h},s_{e}} \right)}{\psi_{h}\left( s_{h} \right)}{M_{he}^{gd}\left( {s_{h},s_{e}} \right)}{M_{he}^{if}\left( {s_{h},s_{e}} \right)}{M_{h}^{g}\left( s_{h} \right)}{{M_{h}^{i}\left( s_{h} \right)}.}}}} & \lbrack 64\rbrack\end{matrix}$In step 609, we solve the self-consistent message equations 657 byiteratively updating the values according to the rules, beginning withsome initial message values 658 and some “evidence” ψ659. The initialvalues 658 can be random or special values, e.g., all random positivenumbers, or all ones. The “evidence” is given by the values of theVariables.

For this sub-step, there are no guarantees of convergence, though formany cases in practice, such systems of equations will quickly converge.One can continue updating the message-update equations until theysatisfy some termination condition, or alternatively stop after a givennumber of iterations.

For our example, in order to obtain some numerical results, we need tospecify the values of the variables defining the model. We choose thecase where all the nodes can be in one of two states, and where thelocal potentials φ_(ij)(s_(i),s_(j)) can be written as: $\begin{matrix}{{{\phi_{ij}\left( {s_{i},s_{j}} \right)} = \begin{pmatrix}{{\exp\left( {J_{ij}/T} \right)}{\exp\left( {{- J_{ij}}/T} \right)}} \\{{\exp\left( {{- J_{ij}}/T} \right)}{\exp\left( {J_{ij}/T} \right)}}\end{pmatrix}},} & \lbrack 65\rbrack\end{matrix}$where J_(ij) is a bond strength and T is the temperature parameter. Thisis known as the Ising model of statistical physics. We set all the bondstrengths between nearest neighbors J_(ij)=1 and the temperature T=2. Wealso set the ψ_(i)(s_(i)) “evidence” potentials to be uniform for allnodes i: $\begin{matrix}{{\psi_{i}\left( s_{i} \right)} = {\begin{pmatrix}1 \\1\end{pmatrix}.}} & \lbrack 66\rbrack\end{matrix}$

This corresponds to the Ising model in a zero field.

Convergent dynamics are obtained for this model by taking, at eachiteration, the new messages to be an average of the “old” messages andthe new computed messages. The messages from one node to one node aredetermined first, and then these results are used to determine the twonode to two node messages. In general, one can order the message-updateequations so that the messages that depend on the fewest nodes aresolved first, so that those results can be used on the left-hand-side ofthe equations for the later messages.

Next, we solve for the region marginal probabilities 660 using theconverged values of the messages and the equations 656 previouslyderived. We insure that the region probabilities are normalized so thatthey sum to one.

Interestingly, for our example, the final messages obtained from a “run”depends on the initial messages. However, the final converged regionprobabilities derived from the final messages always are the same. Thisreflects a general phenomena, which is the fact that when there are moremessages than regions, the messages form an over-complete representationof the region probabilities.

The numerical results obtained using our four cluster Kikuchiapproximation are excellent and much better than those obtained from theprior art pairwise Bethe approximation.

To take one example, the exactly computed value of the probabilityfunction P(s_(b),s_(e)) is, to six significant digits: $\begin{matrix}{{P\left( {s_{b},s_{e}} \right)} = {\begin{pmatrix}0.406496 & 0.093504 \\0.093504 & 0.406496\end{pmatrix}.}} & \lbrack 66\rbrack\end{matrix}$

Our approximation yields: $\begin{matrix}{{P\left( {s_{b},s_{e}} \right)} = {\begin{pmatrix}0.406659 & 0.093341 \\0.093341 & 0.406659\end{pmatrix}.}} & \lbrack 67\rbrack\end{matrix}$while the result from standard belief propagation, equivalent to usingthe Bethe approximation, is: $\begin{matrix}{{P\left( {s_{b},s_{e}} \right)} = {\begin{pmatrix}0.365529 & 0.134471 \\0.134471 & 0.365529\end{pmatrix}.}} & \lbrack 68\rbrack\end{matrix}$

Thus, our generalized belief propagation method has eliminated more than99.6 percent of the error associated with the prior art “standard”belief propagation in this case. This is a significant improvement overthe prior art.

Calculating other numerical quantities gives similar improvements forour example.

Determining the Kikuchi Free Energy

So far, we have described a process for generating message-passingmethods. We have not yet mentioned the Kikuchi free energy calculationsthat motivated our work. In fact, our method as described above haspractical use by itself. On the other hand, for some applications, ourmethod may be of interest to determined the Kikuchi free energy, as wellas other physical quantities, such as the internal energy, the entropy,the specific heat, magnetizations, susceptibilities, and possiblecritical temperatures that can be derived from the Kikuchi free energy.

The Kikuchi approximation to the Gibbs free energy includes thefollowing parts: the internal energy (enthalpy) U, entropic term −TS,and Lagrange multiplier terms constraining the normalization andmarginalizations of the region probabilities. The full formula for theKikuchi approximation to the Gibbs free energy is given in equation[14].

For the internal energy and entropy parts, one must remember to weighteach region's contribution by its over-counting number dr. After themessage-passing method has converged, the Lagrange multiplier terms allsum to zero and can be ignored, because the corresponding constraintsare satisfied.

Application of the Canonical Generalized Belief Propagation Method toSome Important Markov Networks

It may be helpful if we explicitly express the equations obtained usingthe described canonical method for some common Markov networks. Weconsider a square lattice with clusters consisting of all the four-nodesquares in the lattice, and a triangular lattice with clustersconsisting of all the three-node triangles in the lattice.

Square Lattice

In a square lattice of nodes each node is connected to its four nearestneighbors. FIG. 2(a) is an example of a square lattice. This is animportant Markov network for image processing or computer visionapplications. Using each possible four-node loop as a cluster, we derivea set of message-passing rules which improve the accuracy of thecomputed marginal probabilities over the Bethe approximation'sperformance. To obtain improved performance, we introduce double-indexedmessages, i.e., arrays of numbers, in addition to the single-indexedmessages, i.e., vectors that are part of the Bethe approximationmessage-passing.

Following the implementation rules described above, we derive themessage-passing rules, and marginal probabilities in terms of thosemessages. These, we list below. We list only the marginalization andpropagation equations for the regions away from the edges of the arrayof nodes.

The regions obtained from the intersections of the original squareclusters are regions consisting of two nodes connected by a link, andregions consisting of a single node.

The marginal probability of a region consisting of a single node isexpressed as:P(S _(a))=M _(a) ^(c) M _(a) ^(b) M _(a) ^(c) M _(a) ^(d)ψ_(a),  [73]where P(S_(a)) is the marginal probability for each of the differentpossible states S_(a), of the random variable at node a, and M^(e) _(a)is the converged value of the message from ource node e to destinationnode a, and similarly for the messages from the other nodes neighboringnode a. Each of those messages is a vector of the dimensionality of thenumber of different possible states at node a. To avoid cluttering theseformulae, we have not suppressed the functional dependence of themessages; to make it explicit, one would write for example M_(a)^(d)(S_(a)). FIG. 8(a) is a graphical representation of equation [73].Each arrow symbolizes a single-indexed message.

The marginal probability of a region consisting of two linked nodes isexpressed as:P(S _(a) ,S _(b))=M _(a) ^(e) M _(a) ^(c) M _(a) ^(d) M _(b) ^(f) M _(b)^(g) M _(b) ^(h) M _(ab) ^(ef) M _(ab) ^(ch)φ_(ab)ψ_(a)φ_(b),  [74]where the terms are defined analogously as with equation [73], and, forexample, M_(ab) ^(ch) means the double-indexed message from source nodesch to destination nodes ab. Here, the value φ_(ab), is the compatibilitymatrix between nodes a and b, and is a function of S_(a) and S_(b). FIG.8(b) is a graphical representation of Equation [74]. An arrow dragging aline symbolizes a double-indexed message, and a simple line connectingtwo nodes symbolizes a compatibility matrix.

The marginal probability of a square region is expressed as:P(S _(a) , S _(b) ,S _(f) ,S _(e))=M _(a) ^(c) M _(a) ^(d) M _(b) ^(g) M_(b) ^(h) M _(f) ^(k) M _(f) ^(l) M _(e) ^(i) M _(e) ^(j) M _(ab) ^(ch)M _(fb) ^(lg) M _(ef) ^(jk) M _(ea) ^(id) φ_(ab)φ_(ae)φ_(bf)φ_(ef)ψ_(a)ψ_(b)ψ_(f)ψ_(f)ψ_(e)  [75]with the notation defined as for the equations above. FIG. 8(c) is agraphical representation of Equation [75].

By marginalizing equation [74] over S_(b) and setting the result equalto equation [73], we obtain one of the two sets of self-consistentmessage equations that are satisfied by the converged values of themessages:M _(a) ^(b) =M _(b) ^(f) M _(b) ^(g) M _(b) ^(h) M _(ab) ^(ef) M _(ab)^(ch) φ _(ab)ψ_(b).  [76]

FIG. 9 is a graphical representation of equation [76].

By marginalizing equation [75] over S_(e) and S_(f), and setting theresult equal to equation [74], we obtain the other set ofself-consistent message equations: $\begin{matrix}{M_{ab}^{ef} = {\frac{M_{f}^{k}M_{f}^{l}M_{e}^{i}M_{e}^{j}M_{fb}^{\lg}M_{ef}^{jk}M_{ea}^{id}\varphi_{ae}\varphi_{ef}\varphi_{fb}\psi_{e}\psi_{f}}{M_{a}^{e}M_{b}^{f}}.}} & \lbrack 77\rbrack\end{matrix}$

For the values of M_(a) ^(e) and M_(b) ^(f) in equation [77], we use thevalues determined from equation [76]. FIG. 10 is a graphicalrepresentation of equation [77].

Triangular Lattice

A triangular lattice is another Markov network of particular interestfor modeling two dimensional datasets. FIG. 2(b) is an example of atriangular lattice. Analogously with the equations and notation asdescribed above, we described the following.

The marginal probability of a single-node region is expressed as:P(S _(a))=M _(a) ^(b) M _(a) ^(c) M _(a) ^(d) M _(a) ^(e) M _(a) ^(i) M_(a) ^(l) ψ _(a).  [78]

FIG. 11(a) is a graphical representation of equation [78].

The marginal probability of a region consisting of two linked nodes isexpressed as:P(S _(a) , S _(e))=M_(a) ^(b) M _(a) ^(c) M _(a) ^(d) M _(a) ^(i) M _(a)^(l) M _(e) ^(d) M _(e) ^(f) M _(e) ^(g) M _(e) ^(h) M _(e) ^(i) M _(ae)^(d) M _(ae) ^(i)φ_(ae)ψ_(a)ψ_(e).  [79]

FIG. 11(b) is a graphical representation of equation [79].

The marginal probability of a triangle region is expressed as:P(S_(a) , S _(e) , S _(i))=M_(a) ^(l) M _(a) ^(b) M _(a) ^(c) M _(a)^(d) M _(e) ^(d) M _(e) ^(f) M _(e) ^(g) M _(e) ^(h) M _(i) ^(h) M _(i)^(j) M _(i) ^(k) M _(i) ^(l) M _(ai) ^(l) M _(ae) ^(e) M _(ei)^(h)φ_(ae)φ_(ai)φ_(ei)ψ_(a)ψ_(e)ψ_(i).  [80]

FIG. 11(c) is a graphical representation of equation [80].

By marginalizing equation [79] over S_(e), and setting the result equalto [78], we obtain the self-consistent message equation:M _(a) ^(e) =M _(e) ^(d) M _(e) ^(f) M _(e) ^(g) M _(e) ^(h) M _(e) ^(i)M _(ae) ^(d) M _(ae) ^(i)φ_(ae)ψ_(e)  [81]

FIG. 12 is a graphical representation of equation [81].

By marginalizing equation [80] over S_(i), and setting the result equalto equation [79], we obtain the self-consistent message equation:$\begin{matrix}{M_{ae}^{i} = {\frac{M_{k}^{l}M_{i}^{k}M_{i}^{j}M_{i}^{h}M_{ai}^{l}M_{ei}^{h}\varphi_{ai}\varphi_{ei}\psi_{i}}{M_{a}^{i}M_{e}^{i}}.}} & \lbrack 82\rbrack\end{matrix}$

FIG. 13 is a graphical representation of equation [82]. As describedabove for the square lattice, for the values of M_(a) ^(i) and M_(e)^(i) in equation [82], we use the values determined from equation [81]during the same iteration.

Modified Loopy Super-Node Method

We now describe an alternative embodiment of our invention. We call thisembodiment the “modified loopy super-node method.” This method ispartially motivated by Pearl's “clustering method”, and the related“junction tree method” which are prior-art methods that has beenemployed to obtain exact answers for Markov networks with only a fewloops, see “Learning in Graphical Models”, edited by M. I. Jordan, 1998.

In the junction tree method, one generates a Markov network equivalentto the original one by grouping the original nodes into “super-nodes.”New φ compatibility matrices and ψ evidence functions are also generatedfor the super-nodes such that the probability of any state using thesuper-node formulation is equivalent to its probability in the originalMarkov network. If the super-node Markov network has no loops, one canapply standard belief-propagation to determine exactly the desiredmarginal probabilities, even if the original Markov network had someloops. Using the standard belief propagation method on a super-nodenetwork that was designed to have no loops is called the “junction treemethod.”

Of course, if the super-node network still has loops, one is still facedwith a difficult problem which standard belief propagation will notsolve exactly. Nevertheless, one might consider the obvious “naïve”method of using ordinary loopy belief propagation on the super-nodenetwork. This “naïve” method, which one might call the “loopy super-nodemethod,” sometimes gives acceptable results, but empirical testing showsthat it also can give very poor results for some simple networks.

In our alternative embodiment of our invention, we make a modificationto the loopy super-node method which guarantees that it converges. Ourmodified method also gives results that are equivalent to using aKikuchi approximation with clusters corresponding to the chosensuper-nodes. Thus, our “modified loopy super-node method” is analternative method that obtains the same results as our “canonical”method described above.

For concreteness, we illustrate the modified loopy super-node methodusing the same specific Markov network model and choice of clusters thatwe used to illustrate the “canonical” method, see FIG. 14. This networkconsists of nine nodes (α through i), and we choose four arbitraryclusters (abde), (bcef), (degh), and (efhi). These four clusters arealso referred to as “super-nodes.”

Constructing A Super-Node Network

We begin by reformulating the average energy U. The Kikuchiapproximation for the average energy is in fact exactly correct when weuse the correct marginal probabilities. For our example network, theaverage energy is expressed within the Kikuchi approximation as:$\begin{matrix}{U = {{\sum\limits_{\alpha_{abde}}\quad{{P\left( \alpha_{abde} \right)}{E\left( \alpha_{abde} \right)}}} + {\sum\limits_{\alpha_{bcef}}\quad{{P\left( \alpha_{bcef} \right)}{E\left( \alpha_{bcef} \right)}}} + {\sum\limits_{\alpha_{degh}}\quad{{P\left( \alpha_{degh} \right)}{E\left( \alpha_{degh} \right)}}} + {\sum\limits_{\alpha_{efhi}}\quad{{P\left( \alpha_{efhi} \right)}{E\left( \alpha_{efhi} \right)}}} - {\sum\limits_{\alpha_{be}}\quad{{P\left( \alpha_{be} \right)}{E\left( \alpha_{be} \right)}}} - {\sum\limits_{\alpha_{de}}\quad{{P\left( \alpha_{de} \right)}{E\left( \alpha_{de} \right)}}} - {\sum\limits_{\alpha_{je}}\quad{{P\left( \alpha_{je} \right)}{E\left( \alpha_{je} \right)}}} - {\sum\limits_{\alpha_{he}}\quad{{P\left( \alpha_{he} \right)}{E\left( \alpha_{he} \right)}}} + {\sum\limits_{\alpha_{e}}\quad{{P\left( \alpha_{e} \right)}{{E\left( \alpha_{e} \right)}.}}}}} & \lbrack 83\rbrack\end{matrix}$

This expression includes terms that sum over the intersection regions(be), (de), fe), (he), and (e). As for the canonical method above, wewant to eliminate these terms and have an expression that only containssums over the super-node regions. We can easily do that because each ofthe sums over the intersection regions can be expressed as a sum over asuper-node region. For example, $\begin{matrix}{{\sum\limits_{\alpha_{be}}\quad{{P\left( \alpha_{be} \right)}{E\left( \alpha_{be} \right)}}} = {\sum\limits_{\alpha_{abde}}\quad{{P\left( \alpha_{abde} \right)}{{E\left( \alpha_{be} \right)}.}}}} & \lbrack 84\rbrack\end{matrix}$

This equation is true because of the marginalization formulae forP(α_(be)): $\begin{matrix}{{\sum\limits_{\alpha_{abde}}\quad{{P\left( \alpha_{abde} \right)}{E\left( \alpha_{be} \right)}}} = {\sum\limits_{\alpha_{be}}{\sum\limits_{\alpha_{ad}}{{P\left( \alpha_{abde} \right)}{E\left( \alpha_{be} \right)}}}}} & {\lbrack 85\rbrack} \\{{= {\sum\limits_{\alpha_{be}}\quad{{E\left( \alpha_{be} \right)}{\sum\limits_{\alpha_{ad}}\quad{P\left( \alpha_{abde} \right)}}}}}\quad} & {\lbrack 86\rbrack} \\{= {\sum\limits_{\alpha_{be}}{{E\left( \alpha_{be} \right)}{{P\left( \alpha_{be} \right)}.}}}} & {\lbrack 87\rbrack}\end{matrix}$

Therefore, the Kikuchi average energy can be expressed as:$\begin{matrix}{U = {{\sum\limits_{\alpha_{abde}}\quad{{P\left( \alpha_{abde} \right)}{\overset{\sim}{E}\left( \alpha_{abde} \right)}}} + {\sum\limits_{\alpha_{bcef}}\quad{{P\left( \alpha_{bcef} \right)}{\overset{\sim}{E}\left( \alpha_{bcef} \right)}}} + {\sum\limits_{\alpha_{degh}}\quad{{P\left( \alpha_{degh} \right)}{\overset{\sim}{E}\left( \alpha_{degh} \right)}}} + {\sum\limits_{\alpha_{efhi}}\quad{{P\left( \alpha_{efhi} \right)}{{\overset{\sim}{E}\left( \alpha_{efhi} \right)}.}}}}} & \lbrack 88\rbrack\end{matrix}$

The {tilde over (E)} terms are not uniquely determined. They depend onhow we choose to distribute the intersection region terms. One choiceis: $\begin{matrix}{{\overset{\sim}{E}\left( \alpha_{abde} \right)} = {{E\left( \alpha_{abde} \right)} - {E\left( \alpha_{be} \right)} + {E\left( \alpha_{e} \right)}}} & \lbrack 89\rbrack \\{{\overset{\sim}{E}\left( \alpha_{bcef} \right)} = {{E\left( \alpha_{bcef} \right)} - {E\left( \alpha_{ef} \right)}}} & \lbrack 90\rbrack \\{{\overset{\sim}{E}\left( \alpha_{ghde} \right)} = {{E\left( \alpha_{ghde} \right)} - {E\left( \alpha_{de} \right)}}} & \lbrack 91\rbrack \\{{\overset{\sim}{E}\left( \alpha_{efhi} \right)} = {{E\left( \alpha_{efhi} \right)} - {{E\left( \alpha_{he} \right)}.}}} & \lbrack 92\rbrack\end{matrix}$

Our goal is to construct a Markov network between the super-nodes thatis equivalent to the original Markov network. Let us label the foursuper-nodes by the numbers 1, 2, 3, and 4, see FIG. 15 a. We constructthe evidence functions using the equation, for super-node 1.{tilde over (ψ)}₁(α₁)=e ^(−{tilde over (E)}(α) ¹ ^()/T),  [93]and similarly for the other super-nodes.

The compatibility matrices between the super-node regions are chosen sothat if two super-nodes are in consistent states, then the compatibilityfunction is one, otherwise the compatibility function is zero. Forexample, if super-node 1 is in a first “super-node” state such that nodeb is in some first “node” state, and super-node 2 is in a second“super-node” state such that node b is in a second “node” state, thanthe two super-nodes are in inconsistent “super-node” states. Thecompatibility matrix {tilde over (φ)}₁₂(α₁, α₂) should equal zero forthose particular states.

If we define the evidence vectors and compatibility matrices asdescribed, then the joint probability for super-node states is:$\begin{matrix}{{\overset{\sim}{P}\left( {\alpha_{1},\alpha_{2},\alpha_{3},\alpha_{4}} \right)} = {\frac{1}{Z}{\prod\limits_{n}\quad{{{\overset{\sim}{\psi}}_{n}\left( \alpha_{n} \right)}{\prod\limits_{mn}\quad{{\overset{\sim}{\varphi}}_{mn}\left( {\alpha_{m},\alpha_{n}} \right)}}}}}} & \lbrack 94\rbrack\end{matrix}$(n and m are super-node indices) will exactly agree with the jointprobability P defined for the original nodes.

We summarize our construction in FIG. 15 a. We define a new “super-node”graph 1500 that has four super-nodes 1501-1504: 1≡(abde), 2≡(bcef),3≡(degh), and 4≡(efhi). We denote the local “evidence potentials” forthe four super-nodes by {tilde over (ψ)}. We set the pairwiseconnections (links) {tilde over (φ)} 1511-1514 between the super-nodesso that super-nodes must agree on shared nodes. By construction, thesuper-node network 15 a has the same probability distribution as theoriginal network of FIG. 14.

In order that our super-node network is consistent with the Kikuchiapproximation to the Gibbs free energy, we also must enforce all theconstraints between the regions involved in the Kikuchi approximation.We aim for a minimal set of constraints (Lagrange multipliers). Notethat in the canonical method, we used an over-complete set ofconstraints. The following set of marginalization constraints aresufficient for our example network: $\begin{matrix}{{\sum\limits_{\alpha_{ad}}{P\left( \alpha_{abde} \right)}} = {{P\left( \alpha_{be} \right)} = {\sum\limits_{\alpha_{ef}}{P\left( \alpha_{bcef} \right)}}}} & \lbrack 95\rbrack \\{{\sum\limits_{\alpha_{bc}}{P\left( \alpha_{bcef} \right)}} = {{P\left( \alpha_{ef} \right)} = {\sum\limits_{\alpha_{hi}}{P\left( \alpha_{efhi} \right)}}}} & \lbrack 96\rbrack \\{{\sum\limits_{\alpha_{fi}}{P\left( \alpha_{efhi} \right)}} = {{P\left( \alpha_{eh} \right)} = {\sum\limits_{\alpha_{dg}}{P\left( \alpha_{degh} \right)}}}} & \lbrack 97\rbrack \\{{\sum\limits_{\alpha_{gh}}{P\left( \alpha_{degh} \right)}} = {{P\left( \alpha_{de} \right)} = {\sum\limits_{\alpha_{ab}}{P\left( \alpha_{abde} \right)}}}} & \lbrack 98\rbrack \\{{{\sum\limits_{\alpha_{be}}{P\left( \alpha_{be} \right)}} = {P\left( \alpha_{e} \right)}},} & \lbrack 99\rbrack\end{matrix}$in addition to the standard normalization constraints on the super-noderegions.

With these choices, the Kikuchi free energy for this example networktakes the following form: $\begin{matrix}{U = {{\sum\limits_{\alpha_{1}}\quad{{P\left( \alpha_{1} \right)}\overset{\sim}{E}\left( \alpha_{1} \right)}} + {\sum\limits_{\alpha_{2}}\quad{{P\left( \alpha_{2} \right)}\overset{\sim}{E}\left( \alpha_{2} \right)}} + {\sum\limits_{\alpha_{3}}\quad{{P\left( \alpha_{3} \right)}\overset{\sim}{E}\left( \alpha_{3} \right)}} + {\sum\limits_{\alpha_{4}}\quad{{P\left( \alpha_{4} \right)}\overset{\sim}{E}\left( \alpha_{4} \right)}} + {T\left\lbrack {{\sum\limits_{\alpha_{1}}\quad{{P\left( \alpha_{1} \right)}\ln\quad{P\left( \alpha_{1} \right)}}} + {\sum\limits_{\alpha_{2}}\quad{{P\left( \alpha_{2} \right)}\ln\quad{P\left( \alpha_{2} \right)}}} + {\sum\limits_{\alpha_{3}}\quad{{P\left( \alpha_{3} \right)}\ln\quad{P\left( \alpha_{3} \right)}}} + {\sum\limits_{\alpha_{4}}\quad{{P\left( \alpha_{4} \right)}\ln\quad{P\left( \alpha_{4} \right)}}}} \right\rbrack} - {T\left\lbrack {{\sum\limits_{\alpha_{be}}\quad{{P\left( \alpha_{be} \right)}\ln\quad{P\left( \alpha_{be} \right)}}} + {\sum\limits_{\alpha_{de}}\quad{{P\left( \alpha_{de} \right)}\ln\quad{P\left( \alpha_{de} \right)}}} + {\sum\limits_{\alpha_{fe}}\quad{{P\left( \alpha_{fe} \right)}\ln\quad{P\left( \alpha_{fe} \right)}}} + {\sum\limits_{\alpha_{he}}\quad{{P\left( \alpha_{he} \right)}\ln\quad{P\left( \alpha_{he} \right)}}}} \right\rbrack} + {T{\sum\limits_{\alpha_{e}}\quad{{P\left( \alpha_{e} \right)}\ln\quad{P\left( \alpha_{e} \right)}}}} + {\gamma_{1}\left\lbrack {1 - {\sum\limits_{\alpha_{1}}\quad{P\left( \alpha_{1} \right)}}} \right\rbrack} + {\gamma_{2}\left\lbrack {1 - {\sum\limits_{\alpha_{2}}\quad{P\left( \alpha_{2} \right)}}} \right\rbrack} + {\gamma_{3}\left\lbrack {1 - {\sum\limits_{\alpha_{3}}\quad{P\left( \alpha_{3} \right)}}} \right\rbrack} + {\gamma_{4}\left\lbrack {1 - {\sum\limits_{\alpha_{4}}\quad{P\left( \alpha_{4} \right)}}} \right\rbrack} + {\sum\limits_{\alpha_{be}}\quad{{\lambda_{1\backslash{be}}\left( \alpha_{be} \right)}\left\lbrack {{P\left( \alpha_{be} \right)} - {\sum\limits_{\alpha_{1\backslash{be}}}\quad{P\left( \alpha_{1} \right)}}} \right\rbrack}} + {\sum\limits_{\alpha_{be}}\quad{{\lambda_{2\backslash{be}}\left( \alpha_{be} \right)}\left\lbrack {{P\left( \alpha_{be} \right)} - {\sum\limits_{\alpha_{1\backslash{be}}}\quad{P\left( \alpha_{2} \right)}}} \right\rbrack}} + {\sum\limits_{\alpha_{de}}\quad{{\lambda_{1\backslash{de}}\left( \alpha_{de} \right)}\left\lbrack {{P\left( \alpha_{de} \right)} - {\sum\limits_{\alpha_{1\backslash{de}}}\quad{P\left( \alpha_{1} \right)}}} \right\rbrack}} + {\sum\limits_{\alpha_{de}}\quad{{\lambda_{3\backslash{de}}\left( \alpha_{de} \right)}\left\lbrack {{P\left( \alpha_{de} \right)} - {\sum\limits_{\alpha_{3\backslash{de}}}\quad{P\left( \alpha_{3} \right)}}} \right\rbrack}} + {\sum\limits_{\alpha_{fe}}\quad{{\lambda_{2\backslash{fe}}\left( \alpha_{fe} \right)}\left\lbrack {{P\left( \alpha_{fe} \right)} - {\sum\limits_{\alpha_{2\backslash{fe}}}\quad{P\left( \alpha_{2} \right)}}} \right\rbrack}} + {\sum\limits_{\alpha_{fe}}\quad{{\lambda_{4\backslash{fe}}\left( \alpha_{fe} \right)}\left\lbrack {{P\left( \alpha_{fe} \right)} - {\sum\limits_{\alpha_{4\backslash{fe}}}\quad{P\left( \alpha_{4} \right)}}} \right\rbrack}} + {\sum\limits_{\alpha_{he}}\quad{{\lambda_{3\backslash{he}}\left( \alpha_{he} \right)}\left\lbrack {{P\left( \alpha_{he} \right)} - {\sum\limits_{\alpha_{3\backslash{he}}}\quad{P\left( \alpha_{3} \right)}}} \right\rbrack}} + {\sum\limits_{\alpha_{he}}\quad{{\lambda_{4\backslash{he}}\left( \alpha_{he} \right)}\left\lbrack {{P\left( \alpha_{he} \right)} - {\sum\limits_{\alpha_{4\backslash{he}}}\quad{P\left( \alpha_{4} \right)}}} \right\rbrack}} + {\sum\limits_{\alpha_{e}}\quad{{{\lambda_{{be}\backslash e}\left( \alpha_{e} \right)}\left\lbrack {{P\left( \alpha_{e} \right)} - {\sum\limits_{\alpha_{{be}\backslash e}}\quad{P\left( \alpha_{be} \right)}}} \right\rbrack}.}}}} & \lbrack 100\rbrack\end{matrix}$

When we differentiate the Kikuchi free energy with respect to thesuper-node region probabilities, we obtain:P(α₁)=k ₁{tilde over (ψ)}₁(α₁)λ_(1\de)(α_(be))λ_(1\de)(α_(de))  [101]P(α₂)=k ₂{tilde over (ψ)}₂(α₂)λ_(2\be)(α_(be))λ_(2\fe)(α_(fe))  [102]P(α₃)=k ₃{tilde over (ψ)}₃(α₃)λ_(3\de)(α_(de))λ_(3\he)(α_(he))  [103]P(α₄)=k ₄{tilde over (ψ)}₄(α₄)λ_(4\fe)(α_(fe))λ_(4\he)(α_(he))  [104]

Note that these equations are essentially identical to those that wouldbe obtained from the naïve loopy belief propagation method on thesuper-node graph. For example, using the naïve loopy belief propagationmethod, the marginal probability at super-node 1 would be:P(α₁)=k{tilde over (ψ)} ₁(α₁)M ₁ ²(α₁)M ₁ ³(α₁),  [105]where M₁ ²(α₁) is the message from source super-node 2 to destinationsuper-node 1. At first sight, there is a difference with equation [101]because M₁ ²(α₁) seems to depend on the state of all four nodes (abde).In fact, when we take the compatibility matrices into account, we findsthat in the naïve loopy belief propagation method, the messages betweensuper-nodes only depend on the states of the nodes that they share. Thuswe can make the identifications M₁ ²(α_(be))=λ_(1\be)(α_(be)), M₁³(α_(de))=λ_(1\de)(α_(de)), M₂ ¹(α_(be))=λ_(2\be)(α_(be)), M₂⁴(α_(fe))=λ_(2\fe)(α_(fe)), M₃ ¹(α_(de))=λ_(3\de)(α_(de)), M₃⁴(α_(he))=λ_(3\he)(α_(he)), M₄ ²(α_(fe))=λ_(4\fe)(α_(fe)), and M₄³(α_(he))=λ_(4\he)(α_(he)).

In FIG. 15 b, we draw graphically these messages passed in the naïveloopy belief propagation method on the super-node graph.

When we substitute the region probabilities from equations [101-104]into the constraint equations [95-98], we obtain: $\begin{matrix}{{k_{1}{\sum\limits_{\alpha_{1/{be}}}\quad{{{\overset{\sim}{\psi}}_{1}\left( \alpha_{1} \right)}{\lambda_{1\backslash{be}}\left( \alpha_{be} \right)}{\lambda_{1\backslash{de}}\left( \alpha_{de} \right)}}}} = {k_{2}{\sum\limits_{\alpha_{2/{be}}}{{{\overset{\sim}{\psi}}_{2}\left( \alpha_{2} \right)}{\lambda_{2\backslash{be}}\left( \alpha_{be} \right)}{\lambda_{2\backslash{fe}}\left( \alpha_{fe} \right)}}}}} & \lbrack 106\rbrack \\{{k_{1}{\sum\limits_{\alpha_{1/{de}}}\quad{{{\overset{\sim}{\psi}}_{1}\left( \alpha_{1} \right)}{\lambda_{1\backslash{be}}\left( \alpha_{be} \right)}{\lambda_{1\backslash{de}}\left( \alpha_{de} \right)}}}} = {k_{3}{\sum\limits_{\alpha_{3/{de}}}{{{\overset{\sim}{\psi}}_{3}\left( \alpha_{3} \right)}{\lambda_{3\backslash{de}}\left( \alpha_{de} \right)}{\lambda_{3\backslash{he}}\left( \alpha_{he} \right)}}}}} & \lbrack 107\rbrack \\{{k_{2}{\sum\limits_{\alpha_{2/{fe}}}\quad{{{\overset{\sim}{\psi}}_{2}\left( \alpha_{2} \right)}{\lambda_{2\backslash{be}}\left( \alpha_{be} \right)}{\lambda_{2\backslash{fe}}\left( \alpha_{fe} \right)}}}} = {k_{4}{\sum\limits_{\alpha_{4/{fe}}}{{{\overset{\sim}{\psi}}_{4}\left( \alpha_{4} \right)}{\lambda_{4\backslash{fe}}\left( \alpha_{fe} \right)}{\lambda_{4\backslash{he}}\left( \alpha_{he} \right)}}}}} & \lbrack 108\rbrack \\{{k_{3}{\sum\limits_{\alpha_{3/{he}}}\quad{{{\overset{\sim}{\psi}}_{3}\left( \alpha_{3} \right)}{\lambda_{3\backslash{de}}\left( \alpha_{de} \right)}{\lambda_{3\backslash{he}}\left( \alpha_{he} \right)}}}} = {\quad\quad{\quad\quad{k_{4}{\sum\limits_{\alpha_{4/{he}}}{{{\overset{\sim}{\psi}}_{4}\left( \alpha_{4} \right)}{\lambda_{4\backslash{fe}}\left( \alpha_{fe} \right)}{{\lambda_{4\backslash{he}}\left( \alpha_{he} \right)}.}}}}}}} & \lbrack 109\rbrack\end{matrix}$

These equations will be satisfied when we make the identificationbetween the Lagrange multipliers λ and the loopy belief propagationmessages indicated previously, and require that the messages obey thestandard propagation equations.

Take for example, equation [106]. We can simplify this as follows:$\begin{matrix}{{k_{1}{M_{1}^{2}\left( \alpha_{be} \right)}{\sum\limits_{\alpha_{1\backslash{be}}}\quad{{{\overset{\sim}{\psi}}_{1}\left( \alpha_{1} \right)}{M_{1}^{3}\left( \alpha_{de} \right)}}}} = {k_{2}{M_{2}^{1}\left( \alpha_{be} \right)}{\sum\limits_{\alpha_{2\backslash{be}}}\quad{{{\overset{\sim}{\psi}}_{2}\left( \alpha_{2} \right)}{{M_{2}^{4}\left( \alpha_{fe} \right)}.}}}}} & \lbrack 110\rbrack\end{matrix}$

But if we satisfy the standard belief propagation equations on thesuper-node graph $\begin{matrix}{{M_{1}^{2}\left( \alpha_{be} \right)} = {k{\sum\limits_{{\alpha 2}\backslash{be}}{{{\overset{\sim}{\psi}}_{2}\left( \alpha_{2} \right)}{M_{2}^{4}\left( \alpha_{fe} \right)}}}}} & \lbrack 111\rbrack \\{{{M_{2}^{1}\left( \alpha_{be} \right)} = {k{\sum\limits_{{\alpha 1}\backslash{be}}{{{\overset{\sim}{\psi}}_{1}\left( \alpha_{1} \right)}{M_{1}^{3}\left( \alpha_{de} \right)}}}}},} & \lbrack 112\rbrack\end{matrix}$then equation [110] will be satisfied. Thus equations [106-109] areequivalent to the naïve loopy belief propagation equations for thesuper-node network.

However, there exist additional equations that need to be considered inthe Kikuchi formalism that give rise to a modification of the naïveloopy belief propagation method. In particular, when we differentiatethe Kikuchi approximation to the Gibbs free energy with respect toP(α_(de)), P(α_(fc)), P(α_(he)), P(α_(be)), and P(α_(e)), we obtain theequations:P(α_(de))=k _(de)λ_(1\de)(α_(de))λ_(3\de)(α_(de))  [113]P(α_(fe))=k _(fe)λ_(2\fe)(α_(fe))λ_(4\fe)(β_(fe))  [114]P(α_(he))=k _(he)λ_(3\he)(α_(he))λ_(4\he)(α_(he))  [115]P(α_(be))=k _(be)λ_(1\be)(α_(be))λ_(2\be)(α_(be))/λ_(be\e)(α_(e))  [116]P(α_(e))=k _(e)/λ_(be\e)(α_(e))  [117]

The first three of these equations are satisfied automatically when weidentify the λ's 1% with belief propagation messages as before. Forexample, for equation [113], we have: $\begin{matrix}\begin{matrix}{{P\left( \alpha_{de} \right)} = {\sum\limits_{1\backslash{de}}{P\left( \alpha_{1} \right)}}} \\{= {k_{1}{\sum\limits_{1\backslash{de}}{{{\overset{\sim}{\psi}}_{1}\left( \alpha_{1} \right)}{M_{1}^{2}\left( \alpha_{be} \right)}{M_{1}^{3}\left( \alpha_{de} \right)}}}}} \\{= {k_{1}{M_{1}^{3}\left( \alpha_{de} \right)}{\sum\limits_{1\backslash{de}}{{{\overset{\sim}{\psi}}_{1}\left( \alpha_{1} \right)}{M_{1}^{2}\left( \alpha_{be} \right)}}}}} \\{= {{k_{1}{M_{1}^{3}\left( \alpha_{de} \right)}{M_{3}^{1}\left( \alpha_{de} \right)}} = {k_{1}{\lambda_{1\backslash{de}}\left( \alpha_{de} \right)}{\lambda_{3\backslash{de}}\left( \alpha_{de} \right)}}}}\end{matrix} & \lbrack 118\rbrack\end{matrix}$

However, equations [116-117] are not necessarily satisfied when we takethe λ's to have the same values as the belief propagation messages.There is another condition, which can be obtained from the constraintequation${P\left( \alpha_{e} \right)} = {\sum\limits_{\alpha_{b}}\quad{{P\left( \alpha_{be} \right)}.}}$Using this equation, we find: $\begin{matrix}{{{\sum\limits_{\alpha_{b}}{{\lambda_{1\backslash{be}}\left( \alpha_{be} \right)}{\lambda_{2\backslash{be}}\left( \alpha_{be} \right)}}} = k},} & \lbrack 119\rbrack\end{matrix}$where k is some constant. As usual, the actual value of the constant isnot important; what is important in this case is the fact that this sumshould not depend on the state of node e. We can guarantee that all thederivative equations [106-109] and [113-119] are satisfied if weidentify the λ's with loopy belief propagation messages, but make surethat they satisfy the additional constraint of equation [119].

We now define a “normalization” operator. In FIGS. 15 c and 15 d, thenormalization operator is indicated by ellipses 1530, 1541-1544. Thenormalization operator takes as input the messages M₁ ²(α_(be)) and M₂¹(α_(be)) and returns as output normalized messages Norm(M₁ ²(α_(be)))and Norm(M₂ ¹(α_(be))) defined by: $\begin{matrix}{{{Norm}\left( {M_{1}^{2}\left( \alpha_{be} \right)} \right)} = \frac{M_{1}^{2}\left( \alpha_{be} \right)}{\sqrt{\sum\limits_{\alpha_{b}}{{M_{1}^{2}\left( \alpha_{be} \right)}{M_{2}^{1}\left( \alpha_{be} \right)}}}}} & \lbrack 120\rbrack \\{{{Norm}\left( {M_{2}^{1}\left( \alpha_{be} \right)} \right)} = {\frac{M_{2}^{1}\left( \alpha_{be} \right)}{\sqrt{\sum\limits_{\alpha_{b}}{{M_{1}^{2}\left( \alpha_{be} \right)}{M_{2}^{1}\left( \alpha_{be} \right)}}}}.}} & \lbrack 121\rbrack\end{matrix}$

By construction: $\begin{matrix}{{{\sum\limits_{\alpha_{b}}{{{Norm}\left( {M_{1}^{2}\left( \alpha_{be} \right)} \right)}{{Norm}\left( {M_{2}^{1}\left( \alpha_{be} \right)} \right)}}} = 1},} & \lbrack 122\rbrack\end{matrix}$so if we identify the λ's with normalized belief propagation messages,equation [119] will be satisfied. The normalization constraint [119] canbe satisfied simultaneously with the ordinary belief propagationconstraints implicit in equations [106-109], if we follow the modifiedloopy junction tree method (see FIG. 15 c). For this example network,that means that we iteratively solve the belief-propagation equationsfor the super-node lattice, taking care at every iteration to apply thenormalization operator to messages M₁ ²(α_(be)) and M₂ ¹(α_(be)).

Applying this method to our example network yields results identical tothose obtained from the canonical method, as expected. The results aresignificantly more accurate than those obtained from the “naïve” loopysuper-node method.

The Modified Loopy Super-Node Method for a General Markov Network

We have described the modified loopy super-node method for our examplenetwork. We now describe how to use this method for an arbitrary networkwith arbitrary groupings of nodes into super-nodes. A general version ofthe method includes the following steps as shown in FIG. 16.

In step 1601, group all of the nodes of the network 1600 into a set ofintersecting clusters 1650. As stated above, the grouping can beheterogeneous and arbitrary. As an advantage of the arbitrary clusteringaccording to the invention, the user can trade accuracy forcomputational complexity. Larger clusters give better results, but takemore time to resolve. Groups of nodes that are less relevant can begrouped into smaller clusters.

In step 1602, determine a minimal number of marginalization constraints1651 that need to be satisfied between the clusters 1650.

In step 1603, construct a super-node network 1652. In the network 1652,each cluster of nodes is represented by just one super-node 1653.Super-nodes that share one of the marginalization constraints determinedin step 1602 are connected by a super-link 1654.

In step 1604, the super-node network 1652 is searched to locate closedloops of super-nodes that contain at least one common node. For eachsuch closed loop, determine a normalization operator 1655.

In step 1605, update the messages between super-nodes using standardbelief propagation.

In step 1606, replace the messages by their normalized values using thecorresponding normalization operator, and repeat step 1605 untilconvergence in step 1607.

If we have a fixed point of the modified loopy super-node method, thenit is equivalent to the result obtained from minimizing the Kikuchi freeenergy.

Application to Decoding Error-Correcting Codes

An “error-correcting” code is an encoding of digital, e.g., binary,messages. The code adds redundancy that is designed to protect thetransmission of the messages from corruption by noise. Typically, thesender of a message transmits “blocks” of bits (binary digits), whereeach block contains the “data bits” of the original message, plus someadditional “check bits” which help to decode the message if it arrivesat a receiver in a corrupted form. For example, a check bit may encodethe parity of the sum of selected data bits.

Markov networks have been widely used to represent error-correctingcodes, see “Good Error-Correcting Codes based on Very Sparse Matrices”,by D. MacKay, IEEE Transactions on Information Theory, March, 1999. Inthe Markov network formulation of an error-correcting code, some of thenodes of the Markov network will correspond to the data bits and thecheck bits, and the links between the nodes will enforce the appropriaterelationships between data bits and check bits.

The standard decoding algorithm used for some error-correcting codes isthe loopy belief propagation method, using either marginal or MAPprobabilities, run on the Markov network that corresponds to theerror-correcting code, see R. McEliece, D. MacKay, and J. Cheng, IEEE J.Selected Areas in Communication, 1997. When the MAP version of beliefpropagation is used, the decoded states of the bits are just the MAPstates of the corresponding nodes. When the marginal probability versionof belief propagation is used, the decoded states of the bits are takento be those states of the corresponding nodes that have the highestmarginal probability.

In decoding according to our invention, the loopy belief propagationmethod is replaced by the method according to the invention. Our methodprovides improved decoding for error-correcting codes.

Other Applications

The belief propagation method as described herein can be used in anumber of applications. It can be used in computer vision problems wherescenes need to be estimated, see U.S. patent application Ser. No.09/203,108, “Estimating Scenes Using Statistical Properties of Imagesand Scenes,” filed by Freeman et al. on Nov. 30, 1998, also see U.S.patent application Ser. No. 09/236,839, “Estimating Targets UsingStatistical Properties of Observations of Known Targets,” filed byFreeman et al. on Jan. 25, 1999. It can also be used in objectrecognition systems such as the air-to-ground targeting system describedin U.S. Pat. No. 5,963,653, “Hierarchical information fusion objectrecognition system and method,” issued to McNary et al. on Oct. 5, 1999.A number of other applications that are well suited for the inventionare mentioned in U.S. Pat. No. 5,802,256, “Generating improved beliefnetworks,” issued to Heckerman, et al. on Sep. 1, 1998, for example,crop production estimation, program debugging, coastal oceanenvironments prediction, diagnosing linear lightwave networks. Speechrecognition applications that can use the present invention aredescribed in U.S. Pat. No. 5,623,609, “Computer system andcomputer-implemented process for phonology-based automatic speechrecognition,” issued to Kaye et al. on Apr. 22, 1997.

In this description of the invention, we used specific terms andexamples. It is to be understood that various other adaptations andmodifications may be made within the spirit and scope of the invention.Therefore, it is the object of the appended claims to cover all suchvariations and modifications as come within the true spirit and scope ofthe invention.

1. A computer implemented method for determining probabilities of statesof a system represented by a model including a plurality of nodesconnected by links, each node representing possible states of acorresponding part of the system, and each link representing statisticaldependencies between possible states of related nodes, comprising:grouping the plurality of nodes into arbitrary-sized clusters such thatevery node is included in at least one cluster and each link iscompletely contained in at least one cluster; identifying nodes inintersections of clusters, and intersections of intersections ofclusters as regions of nodes; defining messages based on the regions ofnodes, each message having associated sets of source nodes anddestination nodes and a value and a rule depending on other messages andselected links connecting the source nodes and destination nodes;assigning initial values to the messages; updating the value of eachmessage using the associated rule; and determining approximateprobabilities of the states of the system from the messages when atermination condition is reached.
 2. The method of claim 1 wherein thenetwork has pair-wise statistical dependencies between nodes, and theoverall probability of a particular assignment of states s at the nodesis:${{P\left( {s_{1},s_{2},{\ldots\quad s_{N}}} \right)} = {\frac{1}{Z}{\prod\limits_{i,j}\quad{{\phi_{ij}\left( {s_{i},s_{j}} \right)}{\prod\limits_{i}\quad{\psi\left( s_{i} \right)}}}}}},$where the first product runs over all linked neighboring nodes, i and j,and wherein a φ compatibility matrix represents the statisticaldependencies between the possible states s of the related nodes, and theψ function for each node represents evidence that a particular node isin a particular state, and Z is a normalization constant to insure thatthe sum of the probabilities of all possible states of the system isequal to one.
 3. The method of claim 1 wherein the initial values of themessages are random positive numbers.
 4. The method of claim 1 whereinthe initial values of the messages are all ones.
 5. The method of claim1 wherein the termination condition is a convergence of theprobabilities of the states of the system to a predetermined precision.6. The method of claim 1 wherein the approximate probabilities aremarginal probabilities.
 7. The method of claim 1 wherein the approximateprobabilities are maximum a posteriori probabilities.
 8. The method ofclaim 1 wherein the states are discrete.
 9. The method of claim 1wherein the states are continuous.
 10. The method of claim 1 wherein thenetwork model includes closed loops.
 11. The method of claim 1 whereinthe nodes are arranged in a square lattice.
 12. The method of claim 1wherein the nodes are arranged in a triangular lattice.
 13. The methodof claim 1 wherein the nodes and links are a Markov networkrepresentation of an error-correcting code.
 14. A computer implementedmethod for determining probabilities of states of a system representedby a model including a plurality of nodes connected by links, each noderepresenting possible states of a corresponding part of the system, andeach link representing statistical dependencies between possible statesof neighboring nodes, comprising: grouping the plurality of nodes intoarbitrary-sized clusters such that every node is included in at leastone cluster, and each link is completely contained in at least onecluster; identifying nodes in intersecting clusters, and intersectionsof intersecting clusters as regions, and intersections of regions assub-regions; discarding duplicate regions and sub-regions; arranging theregions and sub-regions in a top-to-bottom hierarchy of intersections;defining messages between regions and direct sub-regions directlyconnected in the hierarchy, each message having associated sets ofsource nodes and destination nodes and a value and a rule depending onother messages and selected links connecting the source nodes anddestination nodes, the destination nodes being those nodes in thesub-region, and the source nodes being those nodes in the region andoutside the sub-region; assigning initial values to the messages;updating the value of each message using the associated rule until atermination condition is reached; determining approximate probabilitiesof the states of the system from the messages when a terminationcondition is reached.